Julia’s flagship package for regression is GLM.jl. While this package masters standard regression tasks effectively, it does not have the breadth of R or Stata. Another disadvatage is the rather poor documentation.
The usual syntax of GLM.jl is glm(formula, data, family, link) where formula is the regression equation, data is typically a DataFrame containing all the variables used in the formula, family specifies the distribution (Bernoulli(), Binomial(), Gamma(), Normal(), Poisson(), or NegativeBinomial(θ)) and link specifies how the left and right hand side of the formula are linked. Typical combinations of family and link are: Bernoulli (LogitLink), Binomial (LogitLink), Gamma (InverseLink), InverseGaussian (InverseSquareLink), NegativeBinomial (LogLink), Normal (IdentityLink), Poisson (LogLink).
Some more detials: The usual standard OLS model can be written as
The chosen family/link combination depends on the range of the dependent variable. If
using DataFrames, GLM
df = DataFrame(Y = [1., 2, 3, 4], X1 = [10., 15, 15,20], X2=[2., 5, 1, 0])
reg = glm(@formula(Y~X1+X2),df,Normal(),IdentityLink())
stderror(reg) #standard errors of regression coefficients
coef(reg) #regression coefficients from regression
predict(reg) #predicted y values for observations
deviance(reg) # sum of squared residuals
4×3 DataFrame │ Row │ Y │ X1 │ X2 │ │ │ Float64 │ Float64 │ Float64 │ ├─────┼─────────┼─────────┼─────────┤ │ 1 │ 1.0 │ 10.0 │ 2.0 │ │ 2 │ 2.0 │ 15.0 │ 5.0 │ │ 3 │ 3.0 │ 15.0 │ 1.0 │ │ 4 │ 4.0 │ 20.0 │ 0.0 │ StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Y ~ 1 + X1 + X2 Coefficients: ──────────────────────────────────────────────────────────────────────────── Estimate Std. Error z value Pr(>|z|) Lower 95% Upper 95% ──────────────────────────────────────────────────────────────────────────── (Intercept) -1.16667 1.06719 -1.09322 0.2743 -3.25832 0.924982 X1 0.266667 0.062361 4.27618 <1e-4 0.144441 0.388892 X2 -0.166667 0.117851 -1.41421 0.1573 -0.397651 0.0643173 ──────────────────────────────────────────────────────────────────────────── 3-element Array{Float64,1}: 1.0671873729054733 0.062360956446232275 0.11785113019775778 3-element Array{Float64,1}: -1.166666666666667 0.2666666666666667 -0.1666666666666667 4-element Array{Float64,1}: 1.1666666666666665 2.0000000000000004 2.6666666666666674 4.166666666666667 0.1666666666666662
It should be noted that the Normal/IdentityLink combination is also available under the command lm(formula,data).
It is straightforward to plot a regression line using Plots.jl
using DataFrames, GLM, Plots
df = DataFrame(Y = rand(15)+0.2*collect(1:15), X1 = rand(15)+0.2*collect(1:15))
reg = glm(@formula(Y~X1),df,Normal(),IdentityLink())
yhat = predict(reg) #predicted y values for observations
scatter(df.X1,df.Y,label="data")
plot!(df.X1,yhat,label="regression line")
savefig("./olsPlot.png")
If the dependent variable is binary, the appropriate models are Probit and Logit. (For the time being, the GLM documentation does not state whether the reported coefficients are marginal effects or not.)
using DataFrames, GLM
df = DataFrame(Y = [0, 1, 1, 0], X1 = [10., 15, 15,20], X2=[2., 5, 1, 0])
probit = glm(@formula(Y ~ X1+X2), df, Binomial(), ProbitLink())
logit = glm(@formula(Y ~ X1+X2), df, Binomial(), LogitLink())
4×3 DataFrame │ Row │ Y │ X1 │ X2 │ │ │ Int64 │ Float64 │ Float64 │ ├─────┼───────┼─────────┼─────────┤ │ 1 │ 0 │ 10.0 │ 2.0 │ │ 2 │ 1 │ 15.0 │ 5.0 │ │ 3 │ 1 │ 15.0 │ 1.0 │ │ 4 │ 0 │ 20.0 │ 0.0 │ StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},ProbitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Y ~ 1 + X1 + X2 Coefficients: ────────────────────────────────────────────────────────────────────────────── Estimate Std. Error z value Pr(>|z|) Lower 95% Upper 95% ────────────────────────────────────────────────────────────────────────────── (Intercept) -5.68065 146.987 -0.0386472 0.9692 -293.771 282.409 X1 0.262496 7.35009 0.0357133 0.9715 -14.1434 14.6684 X2 1.31248 36.739 0.0357244 0.9715 -70.6946 73.3196 ────────────────────────────────────────────────────────────────────────────── StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Y ~ 1 + X1 + X2 Coefficients: ──────────────────────────────────────────────────────────────────────────────── Estimate Std. Error z value Pr(>|z|) Lower 95% Upper 95% ──────────────────────────────────────────────────────────────────────────────── (Intercept) -15.9524 882.758 -0.018071 0.9856 -1746.13 1714.22 X1 0.762961 44.1382 0.0172857 0.9862 -85.7464 87.2723 X2 3.8148 220.686 0.0172861 0.9862 -428.722 436.352 ────────────────────────────────────────────────────────────────────────────────
The package CovarianceMatrices.jl provides Newey-West (an other) estimates of the varaince/covariance matrix that are robust to autocorrelation and heteroskedasticity (work on Cluster robust heteroskedasticty consistent estimates appears to be in progress).
using DataFrames, GLM
using CovarianceMatrices
df = DataFrame(Y = [1., 2, 3, 4, 5, 6], X1 = [10., 15, 15,20,5,3], X2=[2., 5, 1, 0,9,7])
reg = glm(@formula(Y~X1+X2),df,Normal(),IdentityLink())
vcov(reg, BartlettKernel(NeweyWest)) #consistent estimate of the long run covariance matrix of the coeficients as in Newey and West (1987)
stderror(reg, BartlettKernel(NeweyWest)) #robust standard errors
6×3 DataFrame │ Row │ Y │ X1 │ X2 │ │ │ Float64 │ Float64 │ Float64 │ ├─────┼─────────┼─────────┼─────────┤ │ 1 │ 1.0 │ 10.0 │ 2.0 │ │ 2 │ 2.0 │ 15.0 │ 5.0 │ │ 3 │ 3.0 │ 15.0 │ 1.0 │ │ 4 │ 4.0 │ 20.0 │ 0.0 │ │ 5 │ 5.0 │ 5.0 │ 9.0 │ │ 6 │ 6.0 │ 3.0 │ 7.0 │ StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Y ~ 1 + X1 + X2 Coefficients: ────────────────────────────────────────────────────────────────────────────── Estimate Std. Error z value Pr(>|z|) Lower 95% Upper 95% ────────────────────────────────────────────────────────────────────────────── (Intercept) 3.125 4.41443 0.707906 0.4790 -5.52712 11.7771 X1 -0.0432692 0.243252 -0.177878 0.8588 -0.520035 0.433496 X2 0.216346 0.444116 0.487139 0.6262 -0.654104 1.0868 ────────────────────────────────────────────────────────────────────────────── 3×3 CovarianceMatrix{LinearAlgebra.Cholesky{Float64,Array{Float64,2}},BartlettKernel{CovarianceMatrices.Optimal{NeweyWest},Float64},Float64,Array{Float64,2}}: 28.269 -1.49566 -2.43813 -1.49566 0.0848069 0.124586 -2.43813 0.124586 0.227041 3-element Array{Float64,1}: 5.316855729374855 0.29121624941233637 0.4764881548232426
The package RegressionTables.jl provides the functionality to generate tables covering the results of several regressions in LaTeX (or HTML) for regression output from GLM.jl or FixedEffectModels.jl. The syntax is “regtable(model1,model2,…;renderSettings=…)” where renderSettings can be either asciiOutput(), latexOutput() or htmlOutput().
using DataFrames, GLM
using RegressionTables
df = DataFrame(Y = [0, 1, 1, 0], X1 = [10., 15, 15,20], X2=[2., 5, 1, 0])
ols = lm(@formula(Y ~ X1+X2), df)
ols1 = lm(@formula(Y ~ X1), df)
probit = glm(@formula(Y ~ X1+X2), df, Binomial(), ProbitLink())
regtable(ols,ols1,probit; renderSettings=asciiOutput())
4×3 DataFrame │ Row │ Y │ X1 │ X2 │ │ │ Int64 │ Float64 │ Float64 │ ├─────┼───────┼─────────┼─────────┤ │ 1 │ 0 │ 10.0 │ 2.0 │ │ 2 │ 1 │ 15.0 │ 5.0 │ │ 3 │ 1 │ 15.0 │ 1.0 │ │ 4 │ 0 │ 20.0 │ 0.0 │ StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Y ~ 1 + X1 + X2 Coefficients: ────────────────────────────────────────────────────────────────────────────── Estimate Std. Error t value Pr(>|t|) Lower 95% Upper 95% ────────────────────────────────────────────────────────────────────────────── (Intercept) -0.333333 2.13437 -0.156174 0.9014 -27.4531 26.7865 X1 0.0333333 0.124722 0.267261 0.8337 -1.55141 1.61808 X2 0.166667 0.235702 0.707107 0.6082 -2.82821 3.16155 ────────────────────────────────────────────────────────────────────────────── StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Y ~ 1 + X1 Coefficients: ─────────────────────────────────────────────────────────────────────────── Estimate Std. Error t value Pr(>|t|) Lower 95% Upper 95% ─────────────────────────────────────────────────────────────────────────── (Intercept) 0.5 1.5411 0.324443 0.7764 -6.13083 7.13083 X1 0.0 0.1 0.0 1.0000 -0.430265 0.430265 ─────────────────────────────────────────────────────────────────────────── StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},ProbitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Y ~ 1 + X1 + X2 Coefficients: ────────────────────────────────────────────────────────────────────────────── Estimate Std. Error z value Pr(>|z|) Lower 95% Upper 95% ────────────────────────────────────────────────────────────────────────────── (Intercept) -5.68065 146.987 -0.0386472 0.9692 -293.771 282.409 X1 0.262496 7.35009 0.0357133 0.9715 -14.1434 14.6684 X2 1.31248 36.739 0.0357244 0.9715 -70.6946 73.3196 ────────────────────────────────────────────────────────────────────────────── ------------------------------------------- Y ----------------------------- (1) (2) (3) ------------------------------------------- (Intercept) -0.333 0.500 -5.681 (2.134) (1.541) (146.987) X1 0.033 0.000 0.262 (0.125) (0.100) (7.350) X2 0.167 1.312 (0.236) (36.739) ------------------------------------------- Estimator OLS OLS NL ------------------------------------------- N 4 4 4 R2 0.333 0.000 -------------------------------------------
The package FixedEffectModels.jl provides a fast implementation of fixed effect models. The syntax is reg(df,formula,Vcov,keywords) where
- df is the DataFrame,
- formula has the structure dependent variable ~ exogenous variables + (endogenous variables ~ instrumental variables) + fe(fixedeffect variable). Several fixed effect variables can be added with “+”. Interaction of fixed effects can be added by “fe(var1)&fe(var2)” and interaction between a fixed effect and a continuous variable can be added as “fe(var1)&var2”.
- Vcov determines how the varaince should be estimated. Options are “Vcov.robust()” and “Vcov.cluster(:State)” or “Vcov.cluster(:State, :Year)”.
- keywords:
- “weights = :Pop” would weight observations according to the variable Pop,
- “subset = df.State .>= 30” would only use observations for which the varaible State is above 30,
- “method” can ge set to either “:cpu” or “:gpu”. For GPU usage, it is recommended to use another keyword, namely “double_precision = false”, to use FLoat32 instead of Float64.
- “save” can be set to “:residuals” to save those, to “:fe” to save the fixed effects or to “true” to save both
- “contrasts = Dict(:YearC => DummyCoding(base = 80))” can specify contrasts for categorical variables.
using DataFrames
using FixedEffectModels
data = DataFrame(Y=rand(20),T=repeat([1,2,3,4],5),State=vcat(ones(4),2*ones(4),3*ones(4),4*ones(4),5*ones(4)),X=rand(20))
reg(data,@formula(Y~X+fe(State)+fe(T)),Vcov.robust())
20×4 DataFrame │ Row │ Y │ T │ State │ X │ │ │ Float64 │ Int64 │ Float64 │ Float64 │ ├─────┼──────────┼───────┼─────────┼─────────────┤ │ 1 │ 0.42193 │ 1 │ 1.0 │ 0.925758 │ │ 2 │ 0.971518 │ 2 │ 1.0 │ 0.173109 │ │ 3 │ 0.607865 │ 3 │ 1.0 │ 0.000122176 │ │ 4 │ 0.477846 │ 4 │ 1.0 │ 0.262952 │ │ 5 │ 0.088192 │ 1 │ 2.0 │ 0.570897 │ │ 6 │ 0.444799 │ 2 │ 2.0 │ 0.687402 │ │ 7 │ 0.878585 │ 3 │ 2.0 │ 0.811851 │ │ 8 │ 0.339433 │ 4 │ 2.0 │ 0.308256 │ │ 9 │ 0.31987 │ 1 │ 3.0 │ 0.677966 │ │ 10 │ 0.045794 │ 2 │ 3.0 │ 0.868849 │ │ 11 │ 0.863658 │ 3 │ 3.0 │ 0.0596709 │ │ 12 │ 0.480076 │ 4 │ 3.0 │ 0.135417 │ │ 13 │ 0.83249 │ 1 │ 4.0 │ 0.288515 │ │ 14 │ 0.766804 │ 2 │ 4.0 │ 0.0833432 │ │ 15 │ 0.163652 │ 3 │ 4.0 │ 0.328403 │ │ 16 │ 0.175665 │ 4 │ 4.0 │ 0.321047 │ │ 17 │ 0.120774 │ 1 │ 5.0 │ 0.348646 │ │ 18 │ 0.208773 │ 2 │ 5.0 │ 0.62603 │ │ 19 │ 0.82246 │ 3 │ 5.0 │ 0.0147338 │ │ 20 │ 0.629823 │ 4 │ 5.0 │ 0.93501 │ Fixed Effect Model =============================================================== Number of obs: 20 Degrees of freedom: 10 R2: 0.263 R2 Adjusted: -0.400 F Statistic: 0.522297 p-value: 0.488 R2 within: 0.058 Iterations: 2 Converged: true =============================================================== Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95% --------------------------------------------------------------- X -0.240669 0.333013 -0.722701 0.486 -0.982667 0.50133 ===============================================================
Note that FixedEffect.jl provides the option to estimate an IV regression in non-Panel data set, i.e. everything works also if we do not specify any “fe()”.
The Econometrics.jl package provides an alternative to the standard packages mentioned above. It also provides routines for OLS, limited dependent variables, panel data (fixed effect, random effects, between estimator), models for nominal and ordinal dependent varaible and heteroskedasticity conistent standard errors. The only reason why I do not fully endorse it is that – for the time being – it is written and maintained by a single person and therefore there is a elevated risk that it may no longer be maintained at some point in the future. In fact, I cannot install it right now due to unsatisfiable dependency requirements. However, it might be a nice option in the future if it should be maintained.