`duckreg` : very fast out-of-memory regressions with `duckdb`

python package to run stratified/saturated regressions out-of-memory with duckdb. The package is a wrapper around the duckdb package and provides a simple interface to run regressions on very large datasets that do not fit in memory by reducing the data to a set of summary statistics and runs weighted least squares with frequency weights. Robust standard errors are computed from sufficient statistics, while clustered standard errors are computed using the cluster bootstrap. Methodological details and benchmarks are provided in this paper. See examples in notebooks/introduction.ipynb.

install

pip install duckreg

dev install (preferably in a venv) with

(uv) pip install git+https://github.com/apoorvalal/duckreg.git

or git clone this repository and install in editable mode.

Currently supports the following regression specifications:

DuckRegression: general linear regression, which compresses the data to y averages stratified by all unique values of the x variables
DuckMundlak: One- or Two-Way Mundlak regression, which compresses the data to the following RHS and avoids the need to incorporate unit (and time FEs)

$$ y \sim 1, w, \bar{w}_{i, .}, \bar{w}_{., t} $$

DuckDoubleDemeaning: Double demeaning regression, which compresses the data to y averages by all values of $w$ after demeaning. This also eliminates unit and time FEs

$$ y \sim (W_{it} - \bar{w}_{i, .} - \bar{w}_{., t} + \bar{w}_{., .}) $$

DuckMundlakEventStudy: Two-way mundlak with dynamic treatment effects. This incorporates treatment-cohort FEs ($\psi_i$), time-period FEs ($\gamma_t$) and dynamic treatment effects $\tau_k$ given by cohort X time interactions.

$$ y \sim \psi_i + \gamma_t + \sum_{k=1}^{T} \tau_{k} D_i 1(t = k) $$

All the above regressions are run in compressed fashion with duckdb.

Please cite the following paper if you use duckreg in your research:

@misc{lal2024largescalelongitudinalexperiments,
      title={Large Scale Longitudinal Experiments: Estimation and Inference}, 
      author={Apoorva Lal and Alexander Fischer and Matthew Wardrop},
      year={2024},
      eprint={2410.09952},
      archivePrefix={arXiv},
      primaryClass={econ.EM},
      url={https://arxiv.org/abs/2410.09952}, 
}

references:

methods:

libraries:

Grant McDermott's duckdb lecture

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
duckreg		duckreg
notebooks		notebooks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`duckreg` : very fast out-of-memory regressions with `duckdb`

About

Releases

Packages

Contributors 2

Languages

License

py-econometrics/duckreg

Folders and files

Latest commit

History

Repository files navigation

duckreg : very fast out-of-memory regressions with duckdb

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`duckreg` : very fast out-of-memory regressions with `duckdb`

Packages