Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(RFC): Adds altair.datasets #3631

Draft
wants to merge 159 commits into
base: main
Choose a base branch
from
Draft

feat(RFC): Adds altair.datasets #3631

wants to merge 159 commits into from

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Oct 4, 2024

Related

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

Examples

These all come from the docstrings of:

  • Loader
  • Loader.from_backend
  • Loader.__call__
import altair as alt
from altair.datasets import Loader

data = Loader.from_backend("polars")
>>> data
Loader[polars]

cars = data("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

data = Loader.from_backend("pandas")
cars = data("cars")

>>> type(cars)
pandas.core.frame.DataFrame

data = Loader.from_backend("pandas[pyarrow]")
cars = data("cars", tag="v1.29.0")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                string[pyarrow]
Miles_per_Gallon    double[pyarrow]
Cylinders            int64[pyarrow]
Displacement        double[pyarrow]
Horsepower           int64[pyarrow]
Weight_in_lbs        int64[pyarrow]
Acceleration        double[pyarrow]
Year                string[pyarrow]
Origin              string[pyarrow]
dtype: object

data = Loader.from_backend("pandas")
source = data("stocks", tag="v2.10.0")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

data = Loader.from_backend("pyarrow")
source = data("stocks", tag="v2.10.0")

>>> source.column_names
['symbol', 'date', 'price']

Tasks

Resolved

Investigate bundling metadata

  • Investigating bundling metadata (22a5039), (1792340)
    • Depending on how well the compression scales, it might be reasonable to include this for some number of versions
    • Deliberately including redundant info early on - can always chip away at it later

npm does not have every version available GitHub

Plan strategy for user-configurable dataset cache

  • Everything so far has been building the tools for a compact bundled index
    • 1, 2, 3, 4, 5
    • Refreshing the index would not be included in altair, each release would simply ship with changes baked in
  • Trying to avoid bloating altair package size with datasets
  • User-facing
    • Goal of requesting each unique dataset version once
      • The user cache would not need to be updated between altair versions
    • Some kind of opt-in config to say store the datasets in this directory please
      • Basic solution would be defining an env variable like ALTAIR_DATASETS_DIR
      • When not provided, always perform remote requests
        • User motivation would be that it would be faster to enable caching

Deferred

Reducing cache footprint

  • e.g. storing the .(csv|tsv|json) files as .parquet
  • Need to do more testing on this though to ensure
    • the shape of each dataset is preserved
    • where relevant - intentional errors remain intact

Investigate providing a decorator to add a backend

  • Will be trivial for the user-side, since they don't need to be concerned about imports
  • Just need to provide these attributes:
    • _name: LiteralString
    • _read_fn: dict[Extension, Callable[..., IntoDataFrameT]]
    • _scan_fn: dict[_ExtensionScan, Callable[..., IntoFrameT]]

Investigate ways to utilize (https://github.com/vega/vega-datasets/blob/main/SOURCES.md)

Provide more meaningful info on the state of ALTAIR_DATASETS_DIR

polars-native solution

from __future__ import annotations

from pathlib import Path

import polars as pl
from altair.datasets import Loader, _readers

data = Loader.from_backend("polars")

# NOTE: Enable caching, populate with some responses
data.cache_dir = Path.home() / ".altair_cache"
data("cars")
data("cars", tag="v1.5.0")
data("movies")
data("movies", tag="v1.24.0")
data("jobs")


if cache_dir := data.cache_dir:
    cached_stems: tuple[str, ...] = tuple(fp.stem for fp in cache_dir.iterdir())
else:
    msg = "Datasets cache unset"
    raise TypeError(msg)

# NOTE: Lots of redundancies, many urls point to the same data (sha)
>>> pl.read_parquet(_readers._METADATA).shape
# (2879, 9)

# NOTE: Version range per sha
tag_sort: pl.Expr = pl.col("tag").sort()
tag_range: pl.Expr = pl.concat_str(tag_sort.first(), tag_sort.last(), separator=" - ")

# NOTE: Producing a name only when the file is already in the cache
file_name: pl.Expr = pl.when(pl.col("sha").is_in(cached_stems)).then(
    pl.concat_str("sha", "suffix")
)

cache_summary: pl.DataFrame = (
    pl.scan_parquet(_readers._METADATA)
    .group_by("dataset_name", "suffix", "size", "sha")
    .agg(tag_range=tag_range)
    .select(pl.exclude("sha"), file_name=file_name)
    .sort("dataset_name", "size")
    .collect()
)

>>> cache_summary.shape
# (116, 5)

>>> cache_summary.head(10)
shape: (10, 5)
┌───────────────┬────────┬─────────┬───────────────────┬─────────────────────────────────┐
│ dataset_name  ┆ suffix ┆ size    ┆ tag_range         ┆ file_name                       │
│ ---           ┆ ---    ┆ ---     ┆ ---               ┆ ---                             │
│ str           ┆ str    ┆ i64     ┆ str               ┆ str                             │
╞═══════════════╪════════╪═════════╪═══════════════════╪═════════════════════════════════╡
│ 7zip          ┆ .png   ┆ 3969    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ airports      ┆ .csv   ┆ 210365  ┆ v1.5.0 - v2.10.0  ┆ 608ba6d51fa70584c3fa1d31eb9453… │
│ annual-precip ┆ .json  ┆ 266265  ┆ v1.29.0 - v2.10.0 ┆ null                            │
│ anscombe      ┆ .json  ┆ 1703    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ barley        ┆ .json  ┆ 8487    ┆ v1.5.0 - v2.10.0  ┆ 8dc50de2509b6e197ce95c24c98f90… │
│ birdstrikes   ┆ .csv   ┆ 1223329 ┆ v2.0.0 - v2.10.0  ┆ null                            │
│ birdstrikes   ┆ .json  ┆ 4183924 ┆ v1.5.0 - v1.31.1  ┆ null                            │
│ budget        ┆ .json  ┆ 374289  ┆ v1.5.0 - v2.8.1   ┆ null                            │
│ budget        ┆ .json  ┆ 391353  ┆ v2.9.0 - v2.10.0  ┆ null                            │
│ budgets       ┆ .json  ┆ 18079   ┆ v1.5.0 - v2.10.0  ┆ 8a909e24f698a3b0f6c637c30ec95e… │
└───────────────┴────────┴─────────┴───────────────────┴─────────────────────────────────┘

Not required for these requests, but may be helpful to avoid limits
As an example, for comparing against the most recent I've added the 5 most recent
- Basic mechanism for discovering new versions
- Tries to minimise number of and total size of requests
Experimenting with querying the url cache w/ expressions
- `metadata_full.parquet` stores **all known** file metadata
- `GitHub.refresh()` to maintain integrity in a safe manner
- Roughly 3000 rows
- Single release: **9kb** vs 46 releases: **21kb**
- Still undecided exactly how this functionality should work
- Need to resolve `npm` tags != `gh` tags issue as well
Doesn't happen in CI, still unclear why the import within `pandas` breaks under these conditions.
Have tried multiple combinations of `pytest.MonkeyPatch`, hard imports, but had no luck in fixing the bug
@mattijn
Copy link
Contributor

mattijn commented Nov 22, 2024

I'm reviewing as an average user of Altair and for this use-case it is probably an associate professor who will need to update all her/his lecture materials at the evening before the semester starts.

What would it be great if we could say:

# old way (this is deprecated)
from vega_datasets import data
# new way (this will be awesome)
from altair.datasets import data

And everything else is still functioning. So this still works:

source_url = data.cars.url
source_pandas = data.cars()

But, the awesome thing that we provide with this PR is:

source_polars = data.cars(backend="polars")  # or `engine=`

Or polars with pyarrow dtypes:

source_pl_pa = data.cars(backend="polars[pyarrow]")  # or `engine=`

If it is like this than I'm fine with engine or backend as argument name. And than within this function we call the agnostic Loader using the dataset and backend choice. All in all, in my humble opinion, awesomeness.

@dangotbanned
Copy link
Member Author

dangotbanned commented Nov 24, 2024

@jonmmease I just tried updating this branch, seems to be some vegafusion issues?

9d97096 (#3631)

Update

Resolved in #3702

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants