Skip to content

Core Python data types for working with OWID's data catalog.

License

Notifications You must be signed in to change notification settings

nossobrasilemdados/owid-catalog-py

 
 

Repository files navigation

build status PyPI version

owid-catalog

A Pythonic API for working with OWID's data catalog.

Status: experimental, APIs likely to change

Quickstart

Install with pip install owid-catalog. Then you can begin exploring the experimental data catalog:

from owid import catalog

# look for Covid-19 data, return a data frame of matches
catalog.find('covid')

# load Covid-19 data from the Our World In Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()

# load data from other than the default `garden` channel
lung_cancer_tables = catalog.find('lung_cancer_deaths_per_100000_men', channels=['open_numbers'])
df = lung_cancer_tables.iloc[0].load()

Development

You need Python 3.8+, poetry and make installed. Clone the repo, then you can simply run:

# run all unit tests and CI checks
make test

# watch for changes, then run all checks
make watch

Data types

Catalog

A catalog is an arbitrarily deep folder structure containing datasets inside. It can be local on disk, or remote.

Load the remote catalog

# find the default OWID catalog and fetch the catalog index over HTTPS
cat = RemoteCatalog()

# get a list of matching tables in different datasets
matches = cat.find('population')

# fetch a data frame for a specific match over HTTPS
t = cat.find_one('population', namespace='gapminder')

# load other channels than `garden`
cat = RemoteCatalog(channels=('garden', 'meadow', 'open_numbers'))

Datasets

A dataset is a folder of tables containing metadata about the overall collection.

  • Metadata about the dataset lives in index.json
  • All tables in the folder must share a common format (CSV or Feather)

Create a new dataset

# make a folder and an empty index.json file
ds = Dataset.create('/tmp/my_data')
# choose CSV instead of feather for files
ds = Dataset.create('/tmp/my_data', format='csv')

Add a table to a dataset

# serialize a table using the table's name and the dataset's default format (feather)
# (e.g. /tmp/my_data/my_table.feather)
ds.add(table)

Remove a table from a dataset

ds.remove('table_name')

Access a table

# load a table including metadata into memory
t = ds['my_table']

List tables

# the length is the number of datasets discovered on disk
assert len(ds) > 0
# iterate over the tables discovered on disk
for table in ds:
    do_something(table)

Add metadata

# you need to manually save your changes
ds.title = "Very Important Dataset"
ds.description = "This dataset is a composite of blah blah blah..."
ds.save()

Copy a dataset

# copying a dataset copies all its files to a new location
ds_new = ds.copy('/tmp/new_data_path')

# copying a dataset is identical to copying its folder, so this works too
shutil.copytree('/tmp/old_data', '/tmp/new_data_path')
ds_new = Dataset('/tmp/new_data_path')

Tables

Tables are essentially pandas DataFrames but with metadata. All operations on them occur in-memory, except for loading from and saving to disk. On disk, they are represented by tabular file (feather or CSV) and a JSON metadata file.

Columns of Table have attribute VariableMeta, including their type, description, and unit. Be carful when manipulating them, not all operations are currently supported. Supported are: adding a column, renaming columns. Not supported: direct assignment to t.columns = ... or to index names t.columns.index = ....

Make a new table

# same API as DataFrames
t = Table({
    'gdp': [1, 2, 3],
    'country': ['AU', 'SE', 'CH']
}).set_index('country')

Add metadata about the whole table

t.title = 'Very important data'

Add metadata about a field

t.gdp.description = 'GDP measured in 2011 international $'
t.sources = [
    Source(title='World Bank', url='https://www.worldbank.org/en/home')
]

Add metadata about all fields at once

# sources and licenses are actually stored a the field level
t.sources = [
    Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
t.licenses = [
    License('CC-BY-SA-4.0', url='https://creativecommons.org/licenses/by-nc/4.0/')
]

Save a table to disk

# save to /tmp/my_table.feather + /tmp/my_table.meta.json
t.to_feather('/tmp/my_table.feather')

# save to /tmp/my_table.csv + /tmp/my_table.meta.json
t.to_csv('/tmp/my_table.csv')

Load a table from disk

These work like normal pandas DataFrames, but if there is also a my_table.meta.json file, then metadata will also get read. Otherwise it will be assumed that the data has no metadata:

t = Table.read_feather('/tmp/my_table.feather')

t = Table.read_csv('/tmp/my_table.csv')

Changelog

  • master
    • Optional repack argument when adding tables to dataset
    • Underscore |
    • Get version field from DatasetMeta init
    • Resolve collisions of underscore_table function
  • v0.2.9
    • Allow multiple channels in catalog.find function
  • v0.2.8
    • Update OWID_CATALOG_VERSION to 2
  • v0.2.7
    • Split datasets into channels (garden, meadow, open_numbers, ...) and make garden default one
    • Add .find_latest method to Catalog
  • v0.2.6
    • Add flag is_public for public/private datasets
    • Enforce snake_case for table, dataset and variable short names
    • Add fields published_by and published_at to Source
    • Added a list of supported and unsupported operations on columns
    • Updated pyarrow
  • v0.2.5
    • Fix ability to load remote CSV tables
  • v0.2.4
    • Update the default catalog URL to use a CDN
  • v0.2.3
    • Fix methods for finding and loading data from a LocalCatalog
  • v0.2.2
    • Repack frames to compact dtypes on Table.to_feather()
  • v0.2.1
    • Fix key typo used in version check
  • v0.2.0
    • Copy dataset metadata into tables, to make tables more traceable
    • Add API versioning, and a requirement to update if your version of this library is too old
  • v0.1.1
    • Add support for Python 3.8
  • v0.1.0
    • Initial release, including searching and fetching data from a remote catalog

About

Core Python data types for working with OWID's data catalog.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 98.0%
  • Makefile 2.0%