PanelOLS: Produced different Std. errors from Stata when clustered by the same variable in linearmodels #477

yuz0101 · 2022-10-12T16:05:42Z

Hi there, it seems that we can only use one cov_type. While comparing results based on a model clustered by a variable between linearmodels and Stata, I found that Stata's std. errors are robust adjusted while linearmodels are not (All coefficients are the same).

Please may I know how I can get the same results as Stata by using linearmodels?

bashtage · 2022-10-12T16:08:02Z

You need to change the cov_type in a call to fit.

https://bashtage.github.io/linearmodels/panel/panel/linearmodels.panel.model.PanelOLS.fit.html#linearmodels.panel.model.PanelOLS.fit

yuz0101 · 2022-10-12T16:18:09Z

Hi bashtage, thanks for your quick reply. I follow that. For example,

mod = PanelOLS.from_formula(formula, data) reg = mod.fit(cov_type='clustered', clusters=data['var'])

The standard errors from the reg are not robust adjusted.

If I change it to reg = mod.fit(cov_type='robust'), then results are not based on the clustered by var one.

Could you please help me with it? How can I get both robust- and clustered-results? I appreciate your time.

bashtage · 2022-10-12T16:36:18Z

What is formula? The definition of robust depends on whether entity effects are included. Clustered std errors are robust to heteroskedasticity.

yuz0101 · 2022-10-12T17:18:09Z

Thanks for that. y ~ 1 + a*b*c + a + b + c is the formula I'm using. There is no entity effects in my regression model. : )

bashtage · 2022-10-12T17:21:03Z

What Stata command are you using?

yuz0101 · 2022-10-12T17:33:02Z

gen abc = a*b*c
reg y abc a b c, vce (cluster var)

bashtage · 2022-10-12T17:51:27Z

What you have looks correct to me

mod = PanelOLS.from_formula(formula, data)
reg = mod.fit(cov_type='clustered', clusters=data['var'])

You could also use

mod = PooledOLS.from_formula(formula, data)
reg = mod.fit(cov_type='clustered', clusters=data['var'])

and the results should be the same.

yuz0101 · 2022-10-12T18:00:58Z

Thank you Kevin, for all your efforts.

I still got different results. I noticed that the std. error from Stata is robust (pls have a look at the scrnshot below). Sorry for the mess data shown below, they are shown in different orders. Would you please look at the last line of pic 1 (black background) and the first line of pic 2 (white background). They are the same variable.

The results from linearmodels:

The results from Stata:

bashtage · 2022-10-12T18:05:44Z

What happends when you take the ratio of the parameter variance from stata to that from linear models? Stata has a log of magic small sample adjustments it makes. If this ratio is the same for all parameters, this indicates that it is a scalar adjustment.

I did a quick check and this looks to be the issue. What does changing the value of debiased do?

yuz0101 · 2022-10-12T18:35:29Z

I just did a check, and most of the std. errors from Sata is 70% of the std. errors from linearmodels. For the debiased, I did the check and found nothing affecting the results.

bashtage · 2022-10-12T18:54:06Z

How large are your clusters and how many do you have? The ratio of the variances looks a lot like 2. If use use xtreg in Stata, do you get the same as reg? IME the adjustments can differ across the different estimators in Stata.

…

On Wed, Oct 12, 2022, 19:35 yuz0101 ***@***.***> wrote: I just did a check, and most of the std. errors from Sata is 70% of the std. errors from linearmodels. For the debiased, I did the check and found nothing affecting the results. — Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKTSRNET7M7XEYRXJPVKB3WC4AH3ANCNFSM6AAAAAARDOA3WE> . You are receiving this because you commented.Message ID: ***@***.***>

yuz0101 · 2022-10-13T10:31:34Z

Hi Kevin, you are right. Massive thanks for your time. The std, errors in my model are clustered by only one dummy variable. However, I tried that with xtreg and found that it produced errors in xtreg, mainly due to the usage of clusters. Then I tried to cluster the id, which completely worked in both linearmodels and Stata (using a command of reg), and they produced the same results. However, the same setting with xtreg in Stata produces different results from using reg and using linearmodels.

Q: Is it reasonable to cluster the std. errors with a dummy variable rather than a variable consisting of a number of clusters?

vgreg · 2024-01-19T11:55:04Z

Hi Kevin, adding my issue here because it is most likely related. By the way, thanks for all your work on linearmodels (and the awesome documentation!), it was the missing piece to my Python workflow.

I'm preparing notes for my students in which I was hoping to replicate Mitch Petersen's results using his sample dataset, which he obtained with Stata using the following command:
regress dependent_variable independent_variables, robust cluster(cluster_variable)

For reference, his Stata code is here, and the sample data and associated estimations are here.

His sample dataset contains 500 firms but only 10 years, so that dimension is small. I'm able to replicate his results using statsmodels, but not with linearmodels. More specifically, I get the same results when clustering by firm, but not when clustering by year or by both firm and year. Are you aware of a small sample adjustment that is implemented in statsmodels but not in linearmodels? debiased does not affect my results.

Here is the code to replicate the difference:

import pandas as pd
import statsmodels.formula.api as smf
from linearmodels import PanelOLS

df = pd.read_table(
    "http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.txt",
    names=["firmid", "year", "x", "y"],
    delim_whitespace=True,
)

smf.ols(formula="y ~ x", data=df).fit(
    cov_type="cluster", cov_kwds={"groups": df["year"]}, use_t=True
).summary()

With statsmodels:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.208
Method:                 Least Squares   F-statistic:                     960.6
Date:                Fri, 19 Jan 2024   Prob (F-statistic):           1.86e-10
Time:                        06:51:42   Log-Likelihood:                -10573.
No. Observations:                5000   AIC:                         2.115e+04
Df Residuals:                    4998   BIC:                         2.116e+04
Df Model:                           1                                         
Covariance Type:              cluster                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0297      0.023      1.269      0.236      -0.023       0.083
x              1.0348      0.033     30.993      0.000       0.959       1.110
==============================================================================
Omnibus:                        4.912   Durbin-Watson:                   1.096
Prob(Omnibus):                  0.086   Jarque-Bera (JB):                4.862
Skew:                           0.070   Prob(JB):                       0.0880
Kurtosis:                       3.063   Cond. No.                         1.01
==============================================================================

Notes:
[1] Standard Errors are robust to cluster correlation (cluster)

df = df.set_index(["firmid", "year"])
mod = PanelOLS.from_formula("y ~ 1 + x", df)
res = mod.fit(cov_type="clustered", cluster_entity=False, cluster_time=True)
res.summary

With linearmodels:

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                      y   R-squared:                        0.2078
Estimator:                   PanelOLS   R-squared (Between):              0.2208
No. Observations:                5000   R-squared (Within):               0.1907
Date:                Fri, Jan 19 2024   R-squared (Overall):              0.2078
Time:                        06:51:42   Log-likelihood                -1.057e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      1310.7
Entities:                         500   P-value                           0.0000
Avg Obs:                      10.0000   Distribution:                  F(1,4998)
Min Obs:                      10.0000                                           
Max Obs:                      10.0000   F-statistic (robust):             1067.1
                                        P-value                           0.0000
Time periods:                      10   Distribution:                  F(1,4998)
Avg Obs:                       500.00                                           
Min Obs:                       500.00                                           
Max Obs:                       500.00                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      0.0297     0.0222     1.3376     0.1811     -0.0138      0.0732
x              1.0348     0.0317     32.667     0.0000      0.9727      1.0969
==============================================================================

bashtage · 2024-01-19T12:01:53Z

You probably need group_debias=True.

mod = PanelOLS.from_formula("y ~ 1 + x", df)
res = mod.fit(cov_type="clustered", cluster_entity=False, cluster_time=True, group_debias=True)
res.summary

                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                      y   R-squared:                        0.2078
Estimator:                   PanelOLS   R-squared (Between):              0.2208
No. Observations:                5000   R-squared (Within):               0.1907
Date:                Fri, Jan 19 2024   R-squared (Overall):              0.2078
Time:                        12:00:52   Log-likelihood                -1.057e+04
Cov. Estimator:             Clustered
                                        F-statistic:                      1310.7
Entities:                         500   P-value                           0.0000
Avg Obs:                      10.0000   Distribution:                  F(1,4998)
Min Obs:                      10.0000
Max Obs:                      10.0000   F-statistic (robust):             960.59
                                        P-value                           0.0000
Time periods:                      10   Distribution:                  F(1,4998)
Avg Obs:                       500.00
Min Obs:                       500.00
Max Obs:                       500.00

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      0.0297     0.0234     1.2691     0.2045     -0.0162      0.0755
x              1.0348     0.0334     30.993     0.0000      0.9694      1.1003
==============================================================================

vgreg · 2024-01-19T12:12:29Z

Thanks, that's what I was missing! I get the same results now.

yuz0101 changed the title ~~PanelOLS: How can i get results based on both the heteroskedasity adjusted standard errors and clustered model~~ PanelOLS: Produced different Std. errors from Stata when clustered by the same variable in linearmodels Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PanelOLS: Produced different Std. errors from Stata when clustered by the same variable in linearmodels #477

PanelOLS: Produced different Std. errors from Stata when clustered by the same variable in linearmodels #477

yuz0101 commented Oct 12, 2022 •

edited

Loading

bashtage commented Oct 12, 2022

yuz0101 commented Oct 12, 2022

bashtage commented Oct 12, 2022

yuz0101 commented Oct 12, 2022

bashtage commented Oct 12, 2022

yuz0101 commented Oct 12, 2022 •

edited

Loading

bashtage commented Oct 12, 2022

yuz0101 commented Oct 12, 2022 •

edited

Loading

bashtage commented Oct 12, 2022

yuz0101 commented Oct 12, 2022

bashtage commented Oct 12, 2022 via email

yuz0101 commented Oct 13, 2022 •

edited

Loading

vgreg commented Jan 19, 2024

bashtage commented Jan 19, 2024

vgreg commented Jan 19, 2024

PanelOLS: Produced different Std. errors from Stata when clustered by the same variable in linearmodels #477

PanelOLS: Produced different Std. errors from Stata when clustered by the same variable in linearmodels #477

Comments

yuz0101 commented Oct 12, 2022 • edited Loading

bashtage commented Oct 12, 2022

yuz0101 commented Oct 12, 2022

bashtage commented Oct 12, 2022

yuz0101 commented Oct 12, 2022

bashtage commented Oct 12, 2022

yuz0101 commented Oct 12, 2022 • edited Loading

bashtage commented Oct 12, 2022

yuz0101 commented Oct 12, 2022 • edited Loading

bashtage commented Oct 12, 2022

yuz0101 commented Oct 12, 2022

bashtage commented Oct 12, 2022 via email

yuz0101 commented Oct 13, 2022 • edited Loading

vgreg commented Jan 19, 2024

bashtage commented Jan 19, 2024

vgreg commented Jan 19, 2024

yuz0101 commented Oct 12, 2022 •

edited

Loading

yuz0101 commented Oct 12, 2022 •

edited

Loading

yuz0101 commented Oct 12, 2022 •

edited

Loading

yuz0101 commented Oct 13, 2022 •

edited

Loading