A Pythonic API for working with OWID's data catalog.
Status: experimental, APIs likely to change
Install with pip install owid-catalog
. Then you can begin exploring the experimental data catalog:
from owid import catalog
# look for Covid-19 data, return a data frame of matches
catalog.find('covid')
# load Covid-19 data from the Our World In Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()
# load data from other than the default `garden` channel
lung_cancer_tables = catalog.find('lung_cancer_deaths_per_100000_men', channels=['open_numbers'])
df = lung_cancer_tables.iloc[0].load()
You need Python 3.8+, poetry
and make
installed. Clone the repo, then you can simply run:
# run all unit tests and CI checks
make test
# watch for changes, then run all checks
make watch
A catalog is an arbitrarily deep folder structure containing datasets inside. It can be local on disk, or remote.
# find the default OWID catalog and fetch the catalog index over HTTPS
cat = RemoteCatalog()
# get a list of matching tables in different datasets
matches = cat.find('population')
# fetch a data frame for a specific match over HTTPS
t = cat.find_one('population', namespace='gapminder')
# load other channels than `garden`
cat = RemoteCatalog(channels=('garden', 'meadow', 'open_numbers'))
A dataset is a folder of tables containing metadata about the overall collection.
- Metadata about the dataset lives in
index.json
- All tables in the folder must share a common format (CSV or Feather)
# make a folder and an empty index.json file
ds = Dataset.create('/tmp/my_data')
# choose CSV instead of feather for files
ds = Dataset.create('/tmp/my_data', format='csv')
# serialize a table using the table's name and the dataset's default format (feather)
# (e.g. /tmp/my_data/my_table.feather)
ds.add(table)
ds.remove('table_name')
# load a table including metadata into memory
t = ds['my_table']
# the length is the number of datasets discovered on disk
assert len(ds) > 0
# iterate over the tables discovered on disk
for table in ds:
do_something(table)
# you need to manually save your changes
ds.title = "Very Important Dataset"
ds.description = "This dataset is a composite of blah blah blah..."
ds.save()
# copying a dataset copies all its files to a new location
ds_new = ds.copy('/tmp/new_data_path')
# copying a dataset is identical to copying its folder, so this works too
shutil.copytree('/tmp/old_data', '/tmp/new_data_path')
ds_new = Dataset('/tmp/new_data_path')
Tables are essentially pandas DataFrames but with metadata. All operations on them occur in-memory, except for loading from and saving to disk. On disk, they are represented by tabular file (feather or CSV) and a JSON metadata file.
Columns of Table
have attribute VariableMeta
, including their type, description, and unit. Be carful when manipulating them, not all operations are currently supported. Supported are: adding a column, renaming columns. Not supported: direct assignment to t.columns = ...
or to index names t.columns.index = ...
.
# same API as DataFrames
t = Table({
'gdp': [1, 2, 3],
'country': ['AU', 'SE', 'CH']
}).set_index('country')
t.title = 'Very important data'
t.gdp.description = 'GDP measured in 2011 international $'
t.sources = [
Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
# sources and licenses are actually stored a the field level
t.sources = [
Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
t.licenses = [
License('CC-BY-SA-4.0', url='https://creativecommons.org/licenses/by-nc/4.0/')
]
# save to /tmp/my_table.feather + /tmp/my_table.meta.json
t.to_feather('/tmp/my_table.feather')
# save to /tmp/my_table.csv + /tmp/my_table.meta.json
t.to_csv('/tmp/my_table.csv')
These work like normal pandas DataFrames, but if there is also a my_table.meta.json
file, then metadata will also get read. Otherwise it will be assumed that the data has no metadata:
t = Table.read_feather('/tmp/my_table.feather')
t = Table.read_csv('/tmp/my_table.csv')
master
- Optional
repack
argument when adding tables to dataset - Underscore
|
- Get
version
field fromDatasetMeta
init - Resolve collisions of
underscore_table
function
- Optional
v0.2.9
- Allow multiple channels in
catalog.find
function
- Allow multiple channels in
v0.2.8
- Update
OWID_CATALOG_VERSION
to 2
- Update
v0.2.7
- Split datasets into channels (
garden
,meadow
,open_numbers
, ...) and make garden default one - Add
.find_latest
method to Catalog
- Split datasets into channels (
v0.2.6
- Add flag
is_public
for public/private datasets - Enforce snake_case for table, dataset and variable short names
- Add fields
published_by
andpublished_at
to Source - Added a list of supported and unsupported operations on columns
- Updated
pyarrow
- Add flag
v0.2.5
- Fix ability to load remote CSV tables
v0.2.4
- Update the default catalog URL to use a CDN
v0.2.3
- Fix methods for finding and loading data from a
LocalCatalog
- Fix methods for finding and loading data from a
v0.2.2
- Repack frames to compact dtypes on
Table.to_feather()
- Repack frames to compact dtypes on
v0.2.1
- Fix key typo used in version check
v0.2.0
- Copy dataset metadata into tables, to make tables more traceable
- Add API versioning, and a requirement to update if your version of this library is too old
v0.1.1
- Add support for Python 3.8
v0.1.0
- Initial release, including searching and fetching data from a remote catalog