You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fitted results from linearmodels can be pickled with pickle.dump. These pickled files contain the estimated parameters, along side all the data required to estimate these parameters. Saving the data required to estimate the results is generally (always?) not desired, as keeping this data in the results substantially increases the size of the pickled files. The estimated parameters however no longer require these potentially large datasets to be displayed or processed.
Example
My use case is as follows, with a large (N = 500'000, T = 123) panel dataset.
Create a list of all desired model specifications and comparisons
Estimate all the different models
Save different comparison of these results with compare
In pseudocode
specifications = pd.DataFrame({"formulas": formulas, "criterium": criteria})
results = []
for formula in specifications["formulas"]:
model = PanelOLS(y, x)
res = model.fit()
results.append(res)
specifications["results"] = results
for criteria in specifications["criteria"].unique():
results = specificiations.query("criterium == @criteria")["results"]
comparison = compare(results)
comparison.summary.as_latex()
As my dataset is very large, pickeling results or the DataFrame specifications takes up multiple Gb's to store just a few number of estimated parameters. Ideally, I would be able to store/pickle the results. That way, I can separate the estimating the models from comparing the models. For example, this would allow someone to do the estimations during the night and kill the process once done.
Workaround
I created this hacky workaround to remove a lot of attributes from the model and result that aren't required if you're only interested in storing the results. With this, I can reduce the size of the pickle objects from ~50Gb to around 250Mb.
import functools
def fake_cov(_deferred_cov, *args, **kwargs):
return _deferred_cov
def shrink_mod_and_res(mod, res):
"""
Remove any DataFrame and large objects that are unnecessarily stored in the model and results objects.
"""
mod.dependent._frame = mod.dependent._frame.head(1)
mod.dependent._original = None
mod.dependent._panel = None
mod.exog._frame = mod.exog._frame.head(1)
mod.exog._original = None
mod.exog._panel = None
mod.weights._frame = mod.weights._frame.head(1)
mod.weights._original = None
mod.weights._panel = None
mod._cov_estimators = None
mod._x = None
mod._y = None
mod._w = None
mod._not_null = None
mod._original_index = None
res._resids = None
res._wresids = None
res._original_index = None
res._effects = None
res._index = None
res._fitted = None
res._idiosyncratic = None
res._not_null = None
_deferred_cov = res._deferred_cov()
res._deferred_cov = functools.partial(fake_cov, _deferred_cov=_deferred_cov)
return mod, res
model = PanelOLS(y, x)
res = model.fit()
mod, res = shrink_mod_and_res(mod,res)
It's not clear to my why the calculation of the covariance is deferred? I suppose if you want to change the covariance estimator after the estimation, that this hacky method needs to store all possible covariance estimations.
Suggestion
Implement a (cleaner) method to remove large datasets contained in the Results, similar to the remove_data flag in the .save() method of statsmodels' models.
The text was updated successfully, but these errors were encountered:
Description
Fitted results from
linearmodels
can be pickled withpickle.dump
. These pickled files contain the estimated parameters, along side all the data required to estimate these parameters. Saving the data required to estimate the results is generally (always?) not desired, as keeping this data in the results substantially increases the size of the pickled files. The estimated parameters however no longer require these potentially large datasets to be displayed or processed.Example
My use case is as follows, with a large (N = 500'000, T = 123) panel dataset.
compare
In pseudocode
As my dataset is very large, pickeling
results
or the DataFramespecifications
takes up multiple Gb's to store just a few number of estimated parameters. Ideally, I would be able to store/pickle the results. That way, I can separate the estimating the models from comparing the models. For example, this would allow someone to do the estimations during the night and kill the process once done.Workaround
I created this hacky workaround to remove a lot of attributes from the model and result that aren't required if you're only interested in storing the results. With this, I can reduce the size of the pickle objects from ~50Gb to around 250Mb.
It's not clear to my why the calculation of the covariance is deferred? I suppose if you want to change the covariance estimator after the estimation, that this hacky method needs to store all possible covariance estimations.
Suggestion
Implement a (cleaner) method to remove large datasets contained in the Results, similar to the
remove_data
flag in the.save()
method ofstatsmodels
' models.The text was updated successfully, but these errors were encountered: