This repository has been archived by the owner on Nov 1, 2023. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
678bfd8
commit 0df85dd
Showing
1 changed file
with
2 additions
and
71 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,72 +1,3 @@ | ||
# owid-repack-py | ||
This library has been folded into OWID's ETL monorepo. | ||
|
||
![build status](https://github.com/owid/owid-repack-py/actions/workflows/python-package.yml/badge.svg) | ||
[![PyPI version](https://badge.fury.io/py/owid-repack.svg)](https://badge.fury.io/py/owid-repack) | ||
![](https://img.shields.io/badge/python-3.8|3.9|3.10|3.11-blue.svg) | ||
|
||
_Pack Pandas DataFrames into smaller, more memory efficient types._ | ||
|
||
## Overview | ||
|
||
When you load data into Pandas, it will use standard types by default: | ||
|
||
- `object` for strings | ||
- `int64` for integers | ||
- `float64` for floating point numbers | ||
|
||
However, for many datasets there is a much more compact representation that Pandas could be using for that data. Using a more compact representation leads to lower memory usage, and smaller binary files on disk when using formats such as Feather and Parquet. | ||
|
||
This library does just one thing: it shrinks your data frames to use smaller types. | ||
|
||
## Installing | ||
|
||
`pip install owid-repack` | ||
|
||
## Usage | ||
|
||
The `owid.repack` module exposes two methods, `repack_series()` and `repack_frame()`. | ||
|
||
`repack_series()` will detect the smallest type that can accurately fit the existing data in the series. | ||
|
||
```ipython | ||
In [1]: from owid import repack | ||
In [2]: pd.Series([1, 2, 3]) | ||
Out[2]: | ||
0 1 | ||
1 2 | ||
2 3 | ||
dtype: int64 | ||
In [3]: repack.repack_series(pd.Series([1.5, 2, 3])) | ||
Out[3]: | ||
0 1.5 | ||
1 2.0 | ||
2 3.0 | ||
dtype: float32 | ||
In [4]: repack.repack_series(pd.Series([1, None, 3])) | ||
Out[4]: | ||
0 1 | ||
1 <NA> | ||
2 3 | ||
dtype: UInt8 | ||
In [5]: repack.repack_series(pd.Series([-1, None, 3])) | ||
Out[5]: | ||
0 -1 | ||
1 <NA> | ||
2 3 | ||
dtype: Int8 | ||
``` | ||
|
||
The `repack_frame()` method simply does this across every column in your DataFrame, returning a new DataFrame. | ||
|
||
## Releases | ||
|
||
- `0.1.2`: | ||
- Shrink columns with all NaNs to Int8 | ||
- `0.1.1`: | ||
- Fix Python support in package metadata to support 3.8.1 onwards | ||
- `0.1.0`: | ||
- Migrate first version from `owid-catalog-py` repo | ||
Find it here: https://github.com/owid/etl/tree/master/lib/repack |