feat(RFC): Adds `altair.datasets` #3631

dangotbanned · 2024-10-04T18:57:00Z

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

No datasets are included in the package
- Instead, included is only a single 18.7KB file metadata.parquet
- The file describes all versions of all datasets
  - provided they are accessible via both npm and github
Strong support for typing
- Annotations are generated from the metadata itself
- https://github.com/vega/altair/blob/9e9deeb95668d2c4e7d30311e85a8f9f6acdc88c/altair/datasets/_typing.py
So far, 5 backends have been implemented, instead of only pandas
- These provide precise IDE completions, with a lot of help from https://github.com/narwhals-dev/narwhals
Users can opt-in to caching remote dataset requests
- With the "polars" backend, the slowest I've had on a cache-hit is 0.1s to load
  - https://cdn.jsdelivr.net/npm/[email protected]/data/flights-200k.json

Examples

These all come from the docstrings of:

Loader
Loader.from_backend
Loader.__call__

import altair as alt
from altair.datasets import Loader

data = Loader.from_backend("polars")
>>> data
Loader[polars]

cars = data("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

data = Loader.from_backend("pandas")
cars = data("cars")

>>> type(cars)
pandas.core.frame.DataFrame

data = Loader.from_backend("pandas[pyarrow]")
cars = data("cars", tag="v1.29.0")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                string[pyarrow]
Miles_per_Gallon    double[pyarrow]
Cylinders            int64[pyarrow]
Displacement        double[pyarrow]
Horsepower           int64[pyarrow]
Weight_in_lbs        int64[pyarrow]
Acceleration        double[pyarrow]
Year                string[pyarrow]
Origin              string[pyarrow]
dtype: object

data = Loader.from_backend("pandas")
source = data("stocks", tag="v2.10.0")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

data = Loader.from_backend("pyarrow")
source = data("stocks", tag="v2.10.0")

>>> source.column_names
['symbol', 'date', 'price']

Tasks

Resolved

Investigate bundling metadata

Investigating bundling metadata (22a5039), (1792340)
- Depending on how well the compression scales, it might be reasonable to include this for some number of versions
- Deliberately including redundant info early on - can always chip away at it later

npm does not have every version available GitHub

Sources
- npm/vega-datasets
  - Fixed with: https://data.jsdelivr.com/v1/packages/npm/vega-datasets
- https://github.com/vega/vega-datasets/tags
Known missing
feat(DRAFT): Add a source for available npm versions
Need to add some handling to invalidate these entries returned from list-repository-tags once confirmed they cannot be requested from npm
- Can technically request from github, but during testing this was much slower
- Also, these versions would not have been available from https://github.com/altair-viz/vega_datasets, since that only used npm

Plan strategy for user-configurable dataset cache

Everything so far has been building the tools for a compact bundled index
- 1, 2, 3, 4, 5
- Refreshing the index would not be included in altair, each release would simply ship with changes baked in
Trying to avoid bloating altair package size with datasets
User-facing
- Goal of requesting each unique dataset version once
  - The user cache would not need to be updated between altair versions
- Some kind of opt-in config to say store the datasets in this directory please
  - Basic solution would be defining an env variable like ALTAIR_DATASETS_DIR
  - When not provided, always perform remote requests
    - User motivation would be that it would be faster to enable caching

Deferred

Reducing cache footprint

e.g. storing the .(csv|tsv|json) files as .parquet
Need to do more testing on this though to ensure
- the shape of each dataset is preserved
- where relevant - intentional errors remain intact

Investigate providing a decorator to add a backend

Will be trivial for the user-side, since they don't need to be concerned about imports
Just need to provide these attributes:
- _name: LiteralString
- _read_fn: dict[Extension, Callable[..., IntoDataFrameT]]
- _scan_fn: dict[_ExtensionScan, Callable[..., IntoFrameT]]

Investigate ways to utilize (https://github.com/vega/vega-datasets/blob/main/SOURCES.md)

Can use a similar approach to feat: Generate expr method signatures, docs #3600
Explored a little locally, finishing it off in this PR would add quite a lot of markdown parsing code

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

How many datasets, size (per & total)?
What version range does a given sha cover?
Blocked: Running into issues with
- pandas/pyarrow group_by warnings
- min and max return all nulls in pl.Enum pola-rs/polars#18394
- Missing nw.Expr.(first|last)
- nw.Expr.(head|tail)(1) not equivalent in a group_by().agg(...) context
  - pandas -> scalar
  - polars -> list
- pl.Enum translating to non-ordered pd.Categorical

polars-native solution

from __future__ import annotations

from pathlib import Path

import polars as pl
from altair.datasets import Loader, _readers

data = Loader.from_backend("polars")

# NOTE: Enable caching, populate with some responses
data.cache_dir = Path.home() / ".altair_cache"
data("cars")
data("cars", tag="v1.5.0")
data("movies")
data("movies", tag="v1.24.0")
data("jobs")


if cache_dir := data.cache_dir:
    cached_stems: tuple[str, ...] = tuple(fp.stem for fp in cache_dir.iterdir())
else:
    msg = "Datasets cache unset"
    raise TypeError(msg)

# NOTE: Lots of redundancies, many urls point to the same data (sha)
>>> pl.read_parquet(_readers._METADATA).shape
# (2879, 9)

# NOTE: Version range per sha
tag_sort: pl.Expr = pl.col("tag").sort()
tag_range: pl.Expr = pl.concat_str(tag_sort.first(), tag_sort.last(), separator=" - ")

# NOTE: Producing a name only when the file is already in the cache
file_name: pl.Expr = pl.when(pl.col("sha").is_in(cached_stems)).then(
    pl.concat_str("sha", "suffix")
)

cache_summary: pl.DataFrame = (
    pl.scan_parquet(_readers._METADATA)
    .group_by("dataset_name", "suffix", "size", "sha")
    .agg(tag_range=tag_range)
    .select(pl.exclude("sha"), file_name=file_name)
    .sort("dataset_name", "size")
    .collect()
)

>>> cache_summary.shape
# (116, 5)

>>> cache_summary.head(10)

shape: (10, 5)
┌───────────────┬────────┬─────────┬───────────────────┬─────────────────────────────────┐
│ dataset_name  ┆ suffix ┆ size    ┆ tag_range         ┆ file_name                       │
│ ---           ┆ ---    ┆ ---     ┆ ---               ┆ ---                             │
│ str           ┆ str    ┆ i64     ┆ str               ┆ str                             │
╞═══════════════╪════════╪═════════╪═══════════════════╪═════════════════════════════════╡
│ 7zip          ┆ .png   ┆ 3969    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ airports      ┆ .csv   ┆ 210365  ┆ v1.5.0 - v2.10.0  ┆ 608ba6d51fa70584c3fa1d31eb9453… │
│ annual-precip ┆ .json  ┆ 266265  ┆ v1.29.0 - v2.10.0 ┆ null                            │
│ anscombe      ┆ .json  ┆ 1703    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ barley        ┆ .json  ┆ 8487    ┆ v1.5.0 - v2.10.0  ┆ 8dc50de2509b6e197ce95c24c98f90… │
│ birdstrikes   ┆ .csv   ┆ 1223329 ┆ v2.0.0 - v2.10.0  ┆ null                            │
│ birdstrikes   ┆ .json  ┆ 4183924 ┆ v1.5.0 - v1.31.1  ┆ null                            │
│ budget        ┆ .json  ┆ 374289  ┆ v1.5.0 - v2.8.1   ┆ null                            │
│ budget        ┆ .json  ┆ 391353  ┆ v2.9.0 - v2.10.0  ┆ null                            │
│ budgets       ┆ .json  ┆ 18079   ┆ v1.5.0 - v2.10.0  ┆ 8a909e24f698a3b0f6c637c30ec95e… │
└───────────────┴────────┴─────────┴───────────────────┴─────────────────────────────────┘

- Allow quickly switching between version tags #3150 (comment)

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

Not required for these requests, but may be helpful to avoid limits

As an example, for comparing against the most recent I've added the 5 most recent

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

Experimenting with querying the url cache w/ expressions

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Doesn't happen in CI, still unclear why the import within `pandas` breaks under these conditions. Have tried multiple combinations of `pytest.MonkeyPatch`, hard imports, but had no luck in fixing the bug

mattijn · 2024-11-22T07:35:44Z

I'm reviewing as an average user of Altair and for this use-case it is probably an associate professor who will need to update all her/his lecture materials at the evening before the semester starts.

What would it be great if we could say:

# old way (this is deprecated)
from vega_datasets import data

# new way (this will be awesome)
from altair.datasets import data

And everything else is still functioning. So this still works:

source_url = data.cars.url
source_pandas = data.cars()

But, the awesome thing that we provide with this PR is:

source_polars = data.cars(backend="polars")  # or `engine=`

Or polars with pyarrow dtypes:

source_pl_pa = data.cars(backend="polars[pyarrow]")  # or `engine=`

If it is like this than I'm fine with engine or backend as argument name. And than within this function we call the agnostic Loader using the dataset and backend choice. All in all, in my humble opinion, awesomeness.

The next commits benefit from having functionality decoupled from `_Reader.query`. Mainly, keeping things lazy and not raising a user-facing error

Simplifies logic that relies on enum/categoricals that may not be recognised as ordered

Docs to follow

#3687 (comment)

narwhals-dev/narwhals#1426, #3693 (comment)

dangotbanned · 2024-11-24T10:16:33Z

@jonmmease I just tried updating this branch, seems to be some vegafusion issues?

9d97096 (#3631)

Update

Resolved in #3702

f21b52b

Feature has been adopted upstream in narwhals-dev/narwhals#1417

Not using doctest style here, none of these return anything but I want them hinted at

Mutability is not needed. Also see #3573

narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment)

@joelostblom

Provides a generalized solution to `pd.read_(csv|json)` requiring the names of date columns to attempt parsing. cc @joelostblom The solution is possible in large part to vega/vega-datasets#631 #3631 (comment)

Related #3706

Related 909e7d0

…arrow` Provides better dtype inference

Switching to one with a timestamp that `frictionless` recognises https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L2674-L2689 https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L45-L57

dangotbanned added 6 commits October 2, 2024 22:13

wip

7933771

feat(DRAFT): Minimal reimplementation

b30081e

refactor: Make version accessible via data.source_tag

279586b

- Allow quickly switching between version tags #3150 (comment)

refactor: ext_fn -> Dataset.read_fn

32150ad

docs: Add trailing docs to long literals

f1d18a2

docs: Add module-level doc

4d3c550

dangotbanned added the maintenance label Oct 4, 2024

dangotbanned added 23 commits October 4, 2024 20:15

Merge branch 'main' into vega-datasets

7e65841

Merge branch 'main' into vega-datasets

05773af

Merge branch 'main' into vega-datasets

4fff80a

feat: Adds .arrow support

3a284a5

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

feat: Add support for caching metadata

22a5039

feat: Support env var VEGA_GITHUB_TOKEN

a618ffc

Not required for these requests, but may be helpful to avoid limits

feat: Add support for multi-version metadata

1792340

As an example, for comparing against the most recent I've added the 5 most recent

refactor: Renaming, docs, reorganize

fa2c9e7

feat: Support collecting release tags

24cd7d7

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

feat: Adds refresh_tags

7dd461f

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

feat(DRAFT): Adds url_from

9768495

Experimenting with querying the url cache w/ expressions

fix: Wrap all requests with auth

c38c235

chore: Remove DATASET_NAMES_USED

a22cc8a

feat: Major GitHub rewrite, handle rate limiting

1181860

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

feat(DRAFT): Partial implement data("name")

31eeb20

fix(typing): Resolve some mypy errors

511a845

Merge branch 'main' into vega-datasets

c76cfd4

Merge branch 'main' into vega-datasets

d3f0497

Merge branch 'main' into vega-datasets

1b3390b

fix(ruff): Apply 3.8 fixes

a770ba9

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

docs(typing): Add WorkInProgress marker to data(...)

686a485

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Merge branch 'main' into vega-datasets

ba4491d

Merge branch 'main' into vega-datasets

1a4e107

test: Add a complex xfail for test_load_call

11da9c8

Doesn't happen in CI, still unclear why the import within `pandas` breaks under these conditions. Have tried multiple combinations of `pytest.MonkeyPatch`, hard imports, but had no luck in fixing the bug

dangotbanned added 9 commits November 22, 2024 16:46

refactor: Renaming/recomposing _readers.py

694ada0

The next commits benefit from having functionality decoupled from `_Reader.query`. Mainly, keeping things lazy and not raising a user-facing error

build: Generate VERSION_LATEST

6f41c7e

Simplifies logic that relies on enum/categoricals that may not be recognised as ordered

feat: Adds _cache.py for UrlCache, DatasetCache

88d06a6

Docs to follow

Merge remote-tracking branch 'upstream/main' into vega-datasets

a0d2df4

ci(ruff): Ignore 0.8.0 violations

f21b52b

#3687 (comment)

Merge remote-tracking branch 'upstream/main' into vega-datasets

de03046

fix: Use stable narwhals imports

e7974d9

narwhals-dev/narwhals#1426, #3693 (comment)

Merge branch 'main' into vega-datasets

8ba48a9

Merge branch 'main' into vega-datasets

9d97096

mattijn mentioned this pull request Nov 24, 2024

Test suite is failing with Vegafusion 2 #3701

Open

dangotbanned added 7 commits November 24, 2024 13:50

Merge remote-tracking branch 'upstream/main' into vega-datasets

a698de9

revert(ruff): Ignore 0.8.0 violations

c907dc5

f21b52b

revert: Remove _readers._filter

a3b38c4

Feature has been adopted upstream in narwhals-dev/narwhals#1417

feat: Adds example and tests for disabling caching

a6c5096

refactor: Tidy up DatasetCache

71423ea

docs: Finish Loader.cache

7dd9c18

Not using doctest style here, none of these return anything but I want them hinted at

refactor(typing): Use Mapping instead of dict

a982759

Mutability is not needed. Also see #3573

This was referenced Nov 24, 2024

Additional metadata for datasets vega/vega-datasets#629

Closed

Can we rename weather.csv? vega/vega-datasets#633

Open

dangotbanned added 8 commits November 30, 2024 14:44

perf: Use to_list() for all backends

d20e9c1

narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment)

feat(DRAFT): Utilize datapackage schemas in pandas backends

909e7d0

Provides a generalized solution to `pd.read_(csv|json)` requiring the names of date columns to attempt parsing. cc @joelostblom The solution is possible in large part to vega/vega-datasets#631 #3631 (comment)

Merge remote-tracking branch 'upstream/main' into vega-datasets

d93fda1

refactor(ruff): Apply TC006 fixes in new code

9274284

Related #3706

docs(DRAFT): Add notes on datapackage.features_typing

8e232b8

docs: Update Loader.from_backend example w/ dtypes

9330895

Related 909e7d0

feat: Use _pl_read_json_roundtrip instead of pl.read_json for `py…

caf534d

…arrow` Provides better dtype inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

mattijn commented Nov 22, 2024

dangotbanned commented Nov 24, 2024 •

edited

Loading

feat(RFC): Adds altair.datasets #3631

Are you sure you want to change the base?

feat(RFC): Adds altair.datasets #3631

Conversation

dangotbanned commented Oct 4, 2024 • edited Loading

Related

Description

Examples

Tasks

Resolved

Deferred

Reducing cache footprint

Investigate providing a decorator to add a backend

Investigate ways to utilize (https://github.com/vega/vega-datasets/blob/main/SOURCES.md)

Provide more meaningful info on the state of ALTAIR_DATASETS_DIR

mattijn commented Nov 22, 2024

dangotbanned commented Nov 24, 2024 • edited Loading

Update

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

dangotbanned commented Nov 24, 2024 •

edited

Loading