Skip to content

Latest commit

 

History

History
361 lines (288 loc) · 22.1 KB

regression.org

File metadata and controls

361 lines (288 loc) · 22.1 KB

Regression Analysis

1 The GLM package

Julia’s flagship package for regression is GLM.jl. While this package masters standard regression tasks effectively, it does not have the breadth of R or Stata. Another disadvatage is the rather poor documentation.

The usual syntax of GLM.jl is glm(formula, data, family, link) where formula is the regression equation, data is typically a DataFrame containing all the variables used in the formula, family specifies the distribution (Bernoulli(), Binomial(), Gamma(), Normal(), Poisson(), or NegativeBinomial(θ)) and link specifies how the left and right hand side of the formula are linked. Typical combinations of family and link are: Bernoulli (LogitLink), Binomial (LogitLink), Gamma (InverseLink), InverseGaussian (InverseSquareLink), NegativeBinomial (LogLink), Normal (IdentityLink), Poisson (LogLink).

Some more detials: The usual standard OLS model can be written as $y=Xβ$ plus some normally distributed error term. That is, $E[Y|X]$ is normally distributed with expectation $Xβ$. Generalized linear models extend this framework by stating that $E[y|X]$ is distributed according to family with expected value $g-1(Xβ)$ where g is the link. Put differently, it is now $g(E[y|X]) that follows the linear model $Xβ$.

The chosen family/link combination depends on the range of the dependent variable. If $y$ is measured on a continuous scale, the Normal/IdentityLink combination will usually be used (sometimes also a LogLink). If $y$ is positive and continuous the Gamma family might be more appropriate. If $y$ is binary, the Binomial with either LogitLink or ProbitLink is appropriate. For count data, the Poisson distribution or the NegativeBinomial could be used.

1.1 OLS

using DataFrames, GLM

df = DataFrame(Y = [1., 2, 3, 4], X1 = [10., 15, 15,20], X2=[2., 5, 1, 0])

reg = glm(@formula(Y~X1+X2),df,Normal(),IdentityLink())

stderror(reg) #standard errors of regression coefficients

coef(reg) #regression coefficients from regression

predict(reg) #predicted y values for observations

deviance(reg) # sum of squared residuals 
4×3 DataFrame
│ Row │ Y       │ X1      │ X2      │
│     │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 1.0     │ 10.0    │ 2.0     │
│ 2   │ 2.0     │ 15.0    │ 5.0     │
│ 3   │ 3.0     │ 15.0    │ 1.0     │
│ 4   │ 4.0     │ 20.0    │ 0.0     │
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Y ~ 1 + X1 + X2

Coefficients:
────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error   z value  Pr(>|z|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)  -1.16667     1.06719   -1.09322    0.2743  -3.25832   0.924982 
X1            0.266667    0.062361   4.27618    <1e-4    0.144441  0.388892 
X2           -0.166667    0.117851  -1.41421    0.1573  -0.397651  0.0643173
────────────────────────────────────────────────────────────────────────────
3-element Array{Float64,1}:
 1.0671873729054733  
 0.062360956446232275
 0.11785113019775778 
3-element Array{Float64,1}:
 -1.166666666666667 
  0.2666666666666667
 -0.1666666666666667
4-element Array{Float64,1}:
 1.1666666666666665
 2.0000000000000004
 2.6666666666666674
 4.166666666666667 
0.1666666666666662

It should be noted that the Normal/IdentityLink combination is also available under the command lm(formula,data).

It is straightforward to plot a regression line using Plots.jl

using DataFrames, GLM, Plots

df = DataFrame(Y = rand(15)+0.2*collect(1:15), X1 = rand(15)+0.2*collect(1:15))

reg = glm(@formula(Y~X1),df,Normal(),IdentityLink())

yhat = predict(reg) #predicted y values for observations

scatter(df.X1,df.Y,label="data")
plot!(df.X1,yhat,label="regression line")
savefig("./olsPlot.png")

./olsPlot.png

1.2 Logit/Probit

If the dependent variable is binary, the appropriate models are Probit and Logit. (For the time being, the GLM documentation does not state whether the reported coefficients are marginal effects or not.)

using DataFrames, GLM

df = DataFrame(Y = [0, 1, 1, 0], X1 = [10., 15, 15,20], X2=[2., 5, 1, 0])

probit = glm(@formula(Y ~ X1+X2), df, Binomial(), ProbitLink())

logit = glm(@formula(Y ~ X1+X2), df, Binomial(), LogitLink())
4×3 DataFrame
│ Row │ Y     │ X1      │ X2      │
│     │ Int64 │ Float64 │ Float64 │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 0     │ 10.0    │ 2.0     │
│ 2   │ 1     │ 15.0    │ 5.0     │
│ 3   │ 1     │ 15.0    │ 1.0     │
│ 4   │ 0     │ 20.0    │ 0.0     │
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},ProbitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Y ~ 1 + X1 + X2

Coefficients:
──────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error     z value  Pr(>|z|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)  -5.68065    146.987    -0.0386472    0.9692  -293.771    282.409 
X1            0.262496     7.35009   0.0357133    0.9715   -14.1434    14.6684
X2            1.31248     36.739     0.0357244    0.9715   -70.6946    73.3196
──────────────────────────────────────────────────────────────────────────────
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Y ~ 1 + X1 + X2

Coefficients:
────────────────────────────────────────────────────────────────────────────────
               Estimate  Std. Error     z value  Pr(>|z|)   Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept)  -15.9524      882.758   -0.018071     0.9856  -1746.13    1714.22  
X1             0.762961     44.1382   0.0172857    0.9862    -85.7464    87.2723
X2             3.8148      220.686    0.0172861    0.9862   -428.722    436.352 
────────────────────────────────────────────────────────────────────────────────

1.3 Robust standard errors

The package CovarianceMatrices.jl provides Newey-West (an other) estimates of the varaince/covariance matrix that are robust to autocorrelation and heteroskedasticity (work on Cluster robust heteroskedasticty consistent estimates appears to be in progress).

using DataFrames, GLM
using CovarianceMatrices

df = DataFrame(Y = [1., 2, 3, 4, 5, 6], X1 = [10., 15, 15,20,5,3], X2=[2., 5, 1, 0,9,7])

reg = glm(@formula(Y~X1+X2),df,Normal(),IdentityLink())

vcov(reg, BartlettKernel(NeweyWest)) #consistent estimate of the long run covariance matrix of the coeficients as in Newey and West (1987)

stderror(reg, BartlettKernel(NeweyWest)) #robust standard errors
6×3 DataFrame
│ Row │ Y       │ X1      │ X2      │
│     │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 1.0     │ 10.0    │ 2.0     │
│ 2   │ 2.0     │ 15.0    │ 5.0     │
│ 3   │ 3.0     │ 15.0    │ 1.0     │
│ 4   │ 4.0     │ 20.0    │ 0.0     │
│ 5   │ 5.0     │ 5.0     │ 9.0     │
│ 6   │ 6.0     │ 3.0     │ 7.0     │
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Y ~ 1 + X1 + X2

Coefficients:
──────────────────────────────────────────────────────────────────────────────
               Estimate  Std. Error    z value  Pr(>|z|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)   3.125        4.41443    0.707906    0.4790  -5.52712   11.7771  
X1           -0.0432692    0.243252  -0.177878    0.8588  -0.520035   0.433496
X2            0.216346     0.444116   0.487139    0.6262  -0.654104   1.0868  
──────────────────────────────────────────────────────────────────────────────
3×3 CovarianceMatrix{LinearAlgebra.Cholesky{Float64,Array{Float64,2}},BartlettKernel{CovarianceMatrices.Optimal{NeweyWest},Float64},Float64,Array{Float64,2}}:
 28.269    -1.49566    -2.43813 
 -1.49566   0.0848069   0.124586
 -2.43813   0.124586    0.227041
3-element Array{Float64,1}:
 5.316855729374855  
 0.29121624941233637
 0.4764881548232426 

1.4 LaTeX code for regression tables

The package RegressionTables.jl provides the functionality to generate tables covering the results of several regressions in LaTeX (or HTML) for regression output from GLM.jl or FixedEffectModels.jl. The syntax is “regtable(model1,model2,…;renderSettings=…)” where renderSettings can be either asciiOutput(), latexOutput() or htmlOutput().

using DataFrames, GLM
using RegressionTables

df = DataFrame(Y = [0, 1, 1, 0], X1 = [10., 15, 15,20], X2=[2., 5, 1, 0])

ols = lm(@formula(Y ~ X1+X2), df)
ols1 = lm(@formula(Y ~ X1), df)
probit = glm(@formula(Y ~ X1+X2), df, Binomial(), ProbitLink())

regtable(ols,ols1,probit; renderSettings=asciiOutput())
4×3 DataFrame
│ Row │ Y     │ X1      │ X2      │
│     │ Int64 │ Float64 │ Float64 │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 0     │ 10.0    │ 2.0     │
│ 2   │ 1     │ 15.0    │ 5.0     │
│ 3   │ 1     │ 15.0    │ 1.0     │
│ 4   │ 0     │ 20.0    │ 0.0     │
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Y ~ 1 + X1 + X2

Coefficients:
──────────────────────────────────────────────────────────────────────────────
               Estimate  Std. Error    t value  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)  -0.333333     2.13437   -0.156174    0.9014  -27.4531    26.7865 
X1            0.0333333    0.124722   0.267261    0.8337   -1.55141    1.61808
X2            0.166667     0.235702   0.707107    0.6082   -2.82821    3.16155
──────────────────────────────────────────────────────────────────────────────
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Y ~ 1 + X1

Coefficients:
───────────────────────────────────────────────────────────────────────────
             Estimate  Std. Error   t value  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)       0.5      1.5411  0.324443    0.7764  -6.13083    7.13083 
X1                0.0      0.1     0.0         1.0000  -0.430265   0.430265
───────────────────────────────────────────────────────────────────────────
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},ProbitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Y ~ 1 + X1 + X2

Coefficients:
──────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error     z value  Pr(>|z|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)  -5.68065    146.987    -0.0386472    0.9692  -293.771    282.409 
X1            0.262496     7.35009   0.0357133    0.9715   -14.1434    14.6684
X2            1.31248     36.739     0.0357244    0.9715   -70.6946    73.3196
──────────────────────────────────────────────────────────────────────────────

-------------------------------------------
                            Y              
              -----------------------------
                  (1)       (2)         (3)
-------------------------------------------
(Intercept)    -0.333     0.500      -5.681
              (2.134)   (1.541)   (146.987)
X1              0.033     0.000       0.262
              (0.125)   (0.100)     (7.350)
X2              0.167                 1.312
              (0.236)              (36.739)
-------------------------------------------
Estimator         OLS       OLS          NL
-------------------------------------------
N                   4         4           4
R2              0.333     0.000            
-------------------------------------------


2 Panel data: Fixed effects (+IV)

The package FixedEffectModels.jl provides a fast implementation of fixed effect models. The syntax is reg(df,formula,Vcov,keywords) where

  • df is the DataFrame,
  • formula has the structure dependent variable ~ exogenous variables + (endogenous variables ~ instrumental variables) + fe(fixedeffect variable). Several fixed effect variables can be added with “+”. Interaction of fixed effects can be added by “fe(var1)&fe(var2)” and interaction between a fixed effect and a continuous variable can be added as “fe(var1)&var2”.
  • Vcov determines how the varaince should be estimated. Options are “Vcov.robust()” and “Vcov.cluster(:State)” or “Vcov.cluster(:State, :Year)”.
  • keywords:
    • “weights = :Pop” would weight observations according to the variable Pop,
    • “subset = df.State .>= 30” would only use observations for which the varaible State is above 30,
    • “method” can ge set to either “:cpu” or “:gpu”. For GPU usage, it is recommended to use another keyword, namely “double_precision = false”, to use FLoat32 instead of Float64.
    • “save” can be set to “:residuals” to save those, to “:fe” to save the fixed effects or to “true” to save both
    • “contrasts = Dict(:YearC => DummyCoding(base = 80))” can specify contrasts for categorical variables.
using DataFrames
using FixedEffectModels

data = DataFrame(Y=rand(20),T=repeat([1,2,3,4],5),State=vcat(ones(4),2*ones(4),3*ones(4),4*ones(4),5*ones(4)),X=rand(20))

reg(data,@formula(Y~X+fe(State)+fe(T)),Vcov.robust())

20×4 DataFrame
│ Row │ Y        │ T     │ State   │ X           │
│     │ Float64  │ Int64 │ Float64 │ Float64     │
├─────┼──────────┼───────┼─────────┼─────────────┤
│ 1   │ 0.42193  │ 1     │ 1.0     │ 0.925758    │
│ 2   │ 0.971518 │ 2     │ 1.0     │ 0.173109    │
│ 3   │ 0.607865 │ 3     │ 1.0     │ 0.000122176 │
│ 4   │ 0.477846 │ 4     │ 1.0     │ 0.262952    │
│ 5   │ 0.088192 │ 1     │ 2.0     │ 0.570897    │
│ 6   │ 0.444799 │ 2     │ 2.0     │ 0.687402    │
│ 7   │ 0.878585 │ 3     │ 2.0     │ 0.811851    │
│ 8   │ 0.339433 │ 4     │ 2.0     │ 0.308256    │
│ 9   │ 0.31987  │ 1     │ 3.0     │ 0.677966    │
│ 10  │ 0.045794 │ 2     │ 3.0     │ 0.868849    │
│ 11  │ 0.863658 │ 3     │ 3.0     │ 0.0596709   │
│ 12  │ 0.480076 │ 4     │ 3.0     │ 0.135417    │
│ 13  │ 0.83249  │ 1     │ 4.0     │ 0.288515    │
│ 14  │ 0.766804 │ 2     │ 4.0     │ 0.0833432   │
│ 15  │ 0.163652 │ 3     │ 4.0     │ 0.328403    │
│ 16  │ 0.175665 │ 4     │ 4.0     │ 0.321047    │
│ 17  │ 0.120774 │ 1     │ 5.0     │ 0.348646    │
│ 18  │ 0.208773 │ 2     │ 5.0     │ 0.62603     │
│ 19  │ 0.82246  │ 3     │ 5.0     │ 0.0147338   │
│ 20  │ 0.629823 │ 4     │ 5.0     │ 0.93501     │
                      Fixed Effect Model                      
===============================================================
Number of obs:              20   Degrees of freedom:         10
R2:                      0.263   R2 Adjusted:            -0.400
F Statistic:          0.522297   p-value:                 0.488
R2 within:               0.058   Iterations:                  2
Converged:                true   
===============================================================
      Estimate Std.Error   t value Pr(>|t|) Lower 95% Upper 95%
---------------------------------------------------------------
X    -0.240669  0.333013 -0.722701    0.486 -0.982667   0.50133
===============================================================

Note that FixedEffect.jl provides the option to estimate an IV regression in non-Panel data set, i.e. everything works also if we do not specify any “fe()”.

3 Econometrics.jl

The Econometrics.jl package provides an alternative to the standard packages mentioned above. It also provides routines for OLS, limited dependent variables, panel data (fixed effect, random effects, between estimator), models for nominal and ordinal dependent varaible and heteroskedasticity conistent standard errors. The only reason why I do not fully endorse it is that – for the time being – it is written and maintained by a single person and therefore there is a elevated risk that it may no longer be maintained at some point in the future. In fact, I cannot install it right now due to unsatisfiable dependency requirements. However, it might be a nice option in the future if it should be maintained.