We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I would effectively like to replicate Stata's reghdfe command, which is based on Correia (2016) estimator.
reghdfe
This estimator allows (fast) converging with greater-than-2 levels of fixed-effect, as well as multi-way clustering.
I am unsure how this would look like — either a new IV estimation module, or somehow piggybacking on AbsorbingLS…
I can achieve most of it via a naive statsmodels implementation, as shown in the dummy example below:
statsmodels
webuse nlswork reghdfe ln_wage hours tenure race, absorb(age union) vce(cluster ind_code)
(dropped 1 singleton observations) (MWFE estimator converged in 3 iterations) HDFE Linear regression Number of obs = 18,890 Absorbing 2 HDFE groups F( 3, 11) = 46.81 Statistics robust to heteroskedasticity Prob > F = 0.0000 R-squared = 0.1766 Adj R-squared = 0.1752 Within R-sq. = 0.1083 Number of clusters (ind_code) = 12 Root MSE = 0.4240 (Std. err. adjusted for 12 clusters in ind_code) ------------------------------------------------------------------------------ | Robust ln_wage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- hours | .0028301 .0015474 1.83 0.095 -.0005757 .0062359 tenure | .0327574 .0034951 9.37 0.000 .0250647 .04045 race | -.1454669 .0250923 -5.80 0.000 -.2006945 -.0902392 _cons | 1.712627 .1019591 16.80 0.000 1.488217 1.937038 ------------------------------------------------------------------------------ Absorbed degrees of freedom: -----------------------------------------------------+ Absorbed FE | Categories - Redundant = Num. Coefs | -------------+---------------------------------------| age | 30 0 30 | union | 2 1 1 | -----------------------------------------------------+
import numpy as np import statsmodels.datasets as sd import statsmodels.formula.api as sm df = sd.webuse("nlswork") dataset = df.dropna( subset=[ "ln_wage", "hours", "tenure", "race", "age", "union", "ind_code", ], how="any", ) model = sm.ols( formula="ln_wage ~ hours + tenure + race + C(age) + C(union)", data=dataset, ) results = model.fit( cov_type="cluster", cov_kwds={"groups": np.array(dataset["ind_code"])}, use_t=True, ) results.summary().as_text()
OLS Regression Results ============================================================================== Dep. Variable: ln_wage R-squared: 0.177 Model: OLS Adj. R-squared: 0.175 Method: Least Squares F-statistic: 2.026e+11 Date: Fri, 08 Sep 2023 Prob (F-statistic): 5.39e-61 Time: 11:06:12 Log-Likelihood: -10579. No. Observations: 18891 AIC: 2.123e+04 Df Residuals: 18856 BIC: 2.150e+04 Df Model: 34 Covariance Type: cluster =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 1.1509 0.035 32.655 0.000 1.073 1.228 C(age)[T.17.0] 0.2170 0.168 1.295 0.222 -0.152 0.586 C(age)[T.18.0] 0.2016 0.087 2.312 0.041 0.010 0.393 C(age)[T.19.0] 0.2886 0.081 3.581 0.004 0.111 0.466 C(age)[T.20.0] 0.3644 0.073 4.965 0.000 0.203 0.526 C(age)[T.21.0] 0.4269 0.080 5.361 0.000 0.252 0.602 C(age)[T.22.0] 0.4735 0.089 5.312 0.000 0.277 0.670 C(age)[T.23.0] 0.4947 0.087 5.685 0.000 0.303 0.686 C(age)[T.24.0] 0.4986 0.087 5.709 0.000 0.306 0.691 C(age)[T.25.0] 0.4958 0.088 5.662 0.000 0.303 0.688 C(age)[T.26.0] 0.5486 0.090 6.092 0.000 0.350 0.747 C(age)[T.27.0] 0.5226 0.076 6.872 0.000 0.355 0.690 C(age)[T.28.0] 0.5165 0.080 6.463 0.000 0.341 0.692 C(age)[T.29.0] 0.5147 0.087 5.946 0.000 0.324 0.705 C(age)[T.30.0] 0.5126 0.084 6.131 0.000 0.329 0.697 C(age)[T.31.0] 0.5692 0.075 7.635 0.000 0.405 0.733 C(age)[T.32.0] 0.5452 0.083 6.535 0.000 0.362 0.729 C(age)[T.33.0] 0.5457 0.075 7.245 0.000 0.380 0.712 C(age)[T.34.0] 0.5255 0.073 7.227 0.000 0.365 0.685 C(age)[T.35.0] 0.5492 0.074 7.463 0.000 0.387 0.711 C(age)[T.36.0] 0.5488 0.081 6.775 0.000 0.371 0.727 C(age)[T.37.0] 0.5506 0.077 7.184 0.000 0.382 0.719 C(age)[T.38.0] 0.5283 0.079 6.705 0.000 0.355 0.702 C(age)[T.39.0] 0.5477 0.083 6.625 0.000 0.366 0.730 C(age)[T.40.0] 0.5431 0.088 6.147 0.000 0.349 0.738 C(age)[T.41.0] 0.5340 0.063 8.520 0.000 0.396 0.672 C(age)[T.42.0] 0.5426 0.094 5.800 0.000 0.337 0.748 C(age)[T.43.0] 0.4841 0.092 5.277 0.000 0.282 0.686 C(age)[T.44.0] 0.4711 0.090 5.243 0.000 0.273 0.669 C(age)[T.45.0] 0.6426 0.098 6.553 0.000 0.427 0.858 C(age)[T.46.0] 0.9095 0.174 5.237 0.000 0.527 1.292 C(union)[T.1.0] 0.1809 0.038 4.810 0.001 0.098 0.264 hours 0.0028 0.002 1.829 0.095 -0.001 0.006 tenure 0.0328 0.003 9.372 0.000 0.025 0.040 race -0.1455 0.025 -5.797 0.000 -0.201 -0.090 ============================================================================== Omnibus: 1064.590 Durbin-Watson: 0.868 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2692.841 Skew: 0.329 Prob(JB): 0.00 Kurtosis: 4.729 Cond. No. 2.93e+04 ============================================================================== Notes: [1] Standard Errors are robust to cluster correlation (cluster) [2] The condition number is large, 2.93e+04. This might indicate that there are strong multicollinearity or other numerical problems.
However, this approach falls short for 2 reasons:
(Note: One may argue that the latter point hints at a faulty analysis design… but I'd prefer to not discuss the “whether it is correct to do so”.)
A model showing contrast in computation time is ln_wage ~ hours + tenure + race + C(age) + C(idcode), clustering SE by ind_code:
ln_wage ~ hours + tenure + race + C(age) + C(idcode)
ind_code
import numpy as np import statsmodels.datasets as sd import statsmodels.formula.api as sm import time df = sd.webuse("nlswork") dataset = df.dropna( subset=[ "ln_wage", "hours", "tenure", "race", "age", "idcode", "ind_code", ], how="any", ) start_time = time.perf_counter() model = sm.ols( formula="ln_wage ~ hours + tenure + race + C(age) + C(idcode)", data=dataset, ) results = model.fit( cov_type="cluster", cov_kwds={"groups": np.array(dataset["ind_code"])}, use_t=True, ) end_time = time.perf_counter() print(f"Model fitting duration: {end_time - start_time:.2f} seconds.") results.summary().as_text()
webuse nlswork set rmsg on reghdfe ln_wage hours tenure race, absorb(age idcode) vce(cluster ind_code)
A model showing Python unable to converge is ln_wage ~ hours + tenure + race + C(age) + C(idcode) + C(grade):
ln_wage ~ hours + tenure + race + C(age) + C(idcode) + C(grade)
I'd be happy to raise or contribute to a PR for it, although I might fall short on the theory of the stats.
Linked to #259
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Context
I would effectively like to replicate Stata's
reghdfe
command, which is based on Correia (2016) estimator.This estimator allows (fast) converging with greater-than-2 levels of fixed-effect, as well as multi-way clustering.
Expected solution
I am unsure how this would look like — either a new IV estimation module, or somehow piggybacking on AbsorbingLS…
Alternatives
I can achieve most of it via a naive
statsmodels
implementation, as shown in the dummy example below:stdout
stdout
However, this approach falls short for 2 reasons:
reghdfe
(cf. below),(Note: One may argue that the latter point hints at a faulty analysis design… but I'd prefer to not discuss the “whether it is correct to do so”.)
A model showing contrast in computation time is
ln_wage ~ hours + tenure + race + C(age) + C(idcode)
, clustering SE byind_code
:Reproducible code
A model showing Python unable to converge is
ln_wage ~ hours + tenure + race + C(age) + C(idcode) + C(grade)
:Other
I'd be happy to raise or contribute to a PR for it, although I might fall short on the theory of the stats.
Linked to #259
The text was updated successfully, but these errors were encountered: