Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(RFC): Adds altair.datasets #3631

Draft
wants to merge 159 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
159 commits
Select commit Hold shift + click to select a range
7933771
wip
dangotbanned Oct 2, 2024
b30081e
feat(DRAFT): Minimal reimplementation
dangotbanned Oct 4, 2024
279586b
refactor: Make version accessible via `data.source_tag`
dangotbanned Oct 4, 2024
32150ad
refactor: `ext_fn` -> `Dataset.read_fn`
dangotbanned Oct 4, 2024
f1d18a2
docs: Add trailing docs to long literals
dangotbanned Oct 4, 2024
4d3c550
docs: Add module-level doc
dangotbanned Oct 4, 2024
7e65841
Merge branch 'main' into vega-datasets
dangotbanned Oct 4, 2024
05773af
Merge branch 'main' into vega-datasets
dangotbanned Oct 5, 2024
4fff80a
Merge branch 'main' into vega-datasets
dangotbanned Oct 6, 2024
3a284a5
feat: Adds `.arrow` support
dangotbanned Oct 7, 2024
22a5039
feat: Add support for caching metadata
dangotbanned Oct 7, 2024
a618ffc
feat: Support env var `VEGA_GITHUB_TOKEN`
dangotbanned Oct 7, 2024
1792340
feat: Add support for multi-version metadata
dangotbanned Oct 7, 2024
fa2c9e7
refactor: Renaming, docs, reorganize
dangotbanned Oct 8, 2024
24cd7d7
feat: Support collecting release tags
dangotbanned Oct 8, 2024
7dd461f
feat: Adds `refresh_tags`
dangotbanned Oct 8, 2024
9768495
feat(DRAFT): Adds `url_from`
dangotbanned Oct 8, 2024
c38c235
fix: Wrap all requests with auth
dangotbanned Oct 8, 2024
a22cc8a
chore: Remove `DATASET_NAMES_USED`
dangotbanned Oct 9, 2024
1181860
feat: Major `GitHub` rewrite, handle rate limiting
dangotbanned Oct 11, 2024
31eeb20
feat(DRAFT): Partial implement `data("name")`
dangotbanned Oct 11, 2024
511a845
fix(typing): Resolve some `mypy` errors
dangotbanned Oct 11, 2024
c76cfd4
Merge branch 'main' into vega-datasets
dangotbanned Oct 12, 2024
d3f0497
Merge branch 'main' into vega-datasets
dangotbanned Oct 13, 2024
1b3390b
Merge branch 'main' into vega-datasets
dangotbanned Oct 24, 2024
a770ba9
fix(ruff): Apply `3.8` fixes
dangotbanned Oct 24, 2024
686a485
docs(typing): Add `WorkInProgress` marker to `data(...)`
dangotbanned Oct 24, 2024
ba4491d
Merge branch 'main' into vega-datasets
dangotbanned Oct 25, 2024
1a4e107
Merge branch 'main' into vega-datasets
dangotbanned Oct 29, 2024
989b9b7
Merge remote-tracking branch 'upstream/main' into vega-datasets
dangotbanned Nov 5, 2024
0bbf2e9
feat(DRAFT): Add a source for available `npm` versions
dangotbanned Nov 5, 2024
9c386e2
refactor: Bake `"v"` prefix into `tags_npm`
dangotbanned Nov 6, 2024
1937f2b
refactor: Move `_npm_metadata` into a class
dangotbanned Nov 6, 2024
66fa6d1
chore: Remove unused, add todo
dangotbanned Nov 6, 2024
937aa01
Merge remote-tracking branch 'upstream/main' into vega-datasets
dangotbanned Nov 6, 2024
21b2edd
feat: Adds `app` context for github<->npm
dangotbanned Nov 6, 2024
6527305
fix: Invalidate old trees
dangotbanned Nov 6, 2024
336eeca
chore: Remove early test files#
dangotbanned Nov 6, 2024
225be0a
refactor: Rename `metadata_full` -> `metadata`
dangotbanned Nov 6, 2024
e91baab
refactor: `tools.vendor_datasets` -> `tools.datasets` package
dangotbanned Nov 6, 2024
7782925
refactor: Move `TypedDict`, `NamedTuple`(s) -> `datasets.models`
dangotbanned Nov 6, 2024
bc86ca1
refactor: Move, rename `semver`-related tools
dangotbanned Nov 6, 2024
a6f5645
refactor: Remove `write_schema` from `_Npm`, `_GitHub`
dangotbanned Nov 6, 2024
07a8342
refactor: Rename, split `_Npm`, `_GitHub` into own modules
dangotbanned Nov 6, 2024
b89e6dc
refactor: Move `DataLoader.__call__` -> `DataLoader.url()`
dangotbanned Nov 6, 2024
7b0fe29
feat(typing): Generate annotations based on known datasets
dangotbanned Nov 6, 2024
572d069
refactor(typing): Utilize `datasets._typing`
dangotbanned Nov 6, 2024
07dcc0b
feat: Adds `Npm.dataset` for remote reading]
dangotbanned Nov 6, 2024
d8f3791
refactor: Remove dead code
dangotbanned Nov 6, 2024
4642a23
refactor: Replace `name_js`, `name_py` with `dataset_name`
dangotbanned Nov 7, 2024
65f87fc
fix: Remove invalid `semver.sort` op
dangotbanned Nov 7, 2024
6349b0f
fix: Add missing init path for `refresh_trees`
dangotbanned Nov 7, 2024
f1d610c
refactor: Move public interface to `_io`
dangotbanned Nov 7, 2024
c4ef112
refactor(perf): Don't recreate path mapping on every attribute access
dangotbanned Nov 7, 2024
eb876eb
refactor: Split `Reader._url_from` into `url`, `_query`
dangotbanned Nov 7, 2024
661a385
feat(DRAFT): Adds `GitHubUrl.BLOBS`
dangotbanned Nov 7, 2024
22dcb17
feat: Store `sha` instead of `github_url`
dangotbanned Nov 7, 2024
669df02
feat(perf): Adds caching to `ALTAIR_DATASETS_DIR`
dangotbanned Nov 7, 2024
2051410
feat(DRAFT): Adds initial generic backends
dangotbanned Nov 7, 2024
0ea4e21
feat: Generate and move `Metadata` (`TypedDict`) to `datasets._typing`
dangotbanned Nov 8, 2024
a2e9baa
feat: Adds optional backends, `polars[pyarrow]`, `with_backend`
dangotbanned Nov 8, 2024
c8a1429
feat: Adds `pyarrow` backend
dangotbanned Nov 8, 2024
279fea9
docs: Update `.with_backend()`
dangotbanned Nov 8, 2024
7d6c7ca
chore: Remove `duckdb` comment
dangotbanned Nov 8, 2024
0bb4210
ci(typing): Add `pyarrow-stubs` to `dev` dependencies
dangotbanned Nov 8, 2024
8984425
refactor: `generate_datasets_typing` -> `Application.generate_typing`
dangotbanned Nov 8, 2024
9d062c8
refactor: Split `datasets` into public/private packages
dangotbanned Nov 8, 2024
a17d674
refactor: Provide `npm` url to `GitHub(...)`
dangotbanned Nov 8, 2024
69a619c
refactor: Rename `ext` -> `suffix`
dangotbanned Nov 8, 2024
a259b10
refactor: Remove unimplemented `tag="latest"`
dangotbanned Nov 8, 2024
88968c8
feat: Rename `_datasets_dir`, make configurable, add docs
dangotbanned Nov 9, 2024
b987308
docs: Adds examples to `Loader.with_backend`
dangotbanned Nov 9, 2024
4a2a2e0
refactor: Clean up requirements -> imports
dangotbanned Nov 9, 2024
e6dd27e
docs: Add basic example to `Loader` class
dangotbanned Nov 9, 2024
2a7bc4f
refactor: Reorder `alt.datasets` module
dangotbanned Nov 9, 2024
c572180
docs: Fill out `Loader.url`
dangotbanned Nov 9, 2024
9ab9463
feat: Adds `_Reader._read_metadata`
dangotbanned Nov 9, 2024
dd3edd6
refactor: Rename `(reader|scanner_from()` -> `(read|scan)_fn()`
dangotbanned Nov 9, 2024
146cb50
refactor(typing): Replace some explicit casts
dangotbanned Nov 9, 2024
94ad0d1
refactor: Shorten and document request delays
dangotbanned Nov 10, 2024
4093383
feat(DRAFT): Make `[tag]` a `pl.Enum`
dangotbanned Nov 10, 2024
76cdd45
fix: Handle `pyarrow` scalars conversion
dangotbanned Nov 10, 2024
bb7bc17
test: Adds `test_datasets`
dangotbanned Nov 10, 2024
ebc1bfa
fix(DRAFT): hotfix `pyarrow` read
dangotbanned Nov 10, 2024
fe0ae88
fix(DRAFT): Treat `polars` as exception, invalidate cache
dangotbanned Nov 10, 2024
7089f2a
test: Skip `pyarrow` tests on `3.9`
dangotbanned Nov 10, 2024
e1290d4
refactor: Tidy up changes from last 4 commits
dangotbanned Nov 11, 2024
9d88e1b
refactor: Rework `_readers.py`
dangotbanned Nov 11, 2024
60d39f5
test: Adds tests for missing dependencies
dangotbanned Nov 11, 2024
d6f0e45
test: Adds `test_dataset_not_found`
dangotbanned Nov 11, 2024
b7d57a0
test: Adds `test_reader_cache`
dangotbanned Nov 11, 2024
5c2e581
Merge remote-tracking branch 'upstream/main' into vega-datasets
dangotbanned Nov 11, 2024
b70aef8
docs: Finish `_Reader`, fill parameters of `Loader.__call__`
dangotbanned Nov 11, 2024
403b787
refactor: Rename `backend` -> `backend_name`, `get_backend` -> `backend`
dangotbanned Nov 11, 2024
3fbc759
fix(DRAFT): Add multiple fallbacks for `pyarrow` JSON
dangotbanned Nov 12, 2024
4f5b4de
test: Remove `pandas` fallback for `pyarrow`
dangotbanned Nov 12, 2024
69a72b6
test: Adds `test_all_datasets`
dangotbanned Nov 12, 2024
08101cc
refactor: Remove `_Reader._response`
dangotbanned Nov 12, 2024
90428a6
fix: Correctly handle no remote connection
dangotbanned Nov 12, 2024
8ad78c1
docs: Align `_typing.Metadata` and `Loader.(url|__call__)` descriptions
dangotbanned Nov 12, 2024
e650454
feat: Update to `v2.10.0`, fix tag inconsistency
dangotbanned Nov 12, 2024
72296b0
refactor: Tidying up `tools.datasets`
dangotbanned Nov 12, 2024
ca1b500
revert: Remove tags schema files
dangotbanned Nov 12, 2024
5bd70d1
ci: Introduce `datasets` refresh to `generate_schema_wrapper`
dangotbanned Nov 12, 2024
012f98b
docs: Add `tools.datasets.Application` doc
dangotbanned Nov 12, 2024
bc0f42c
Merge remote-tracking branch 'upstream/main' into vega-datasets
dangotbanned Nov 12, 2024
5e677c0
revert: Remove comment
dangotbanned Nov 12, 2024
a99d2c9
docs: Add a table preview to `Metadata`
dangotbanned Nov 13, 2024
7e6da39
docs: Add examples for `Loader.__call__`
dangotbanned Nov 13, 2024
b49e679
refactor: Rename `DatasetName` -> `Dataset`, `VersionTag` -> `Version`
dangotbanned Nov 13, 2024
7a14394
fix: Ensure latest `[tag]` appears first
dangotbanned Nov 13, 2024
99f823e
refactor: Misc `models.py` updates
dangotbanned Nov 13, 2024
dcef1d9
docs: Update `tools.datasets.__init__.py`
dangotbanned Nov 13, 2024
173f3d6
test: Fix `@datasets_debug` selection
dangotbanned Nov 13, 2024
3f5a805
test: Add support for overrides in `test_all_datasets`
dangotbanned Nov 13, 2024
4fc8446
test: Adds `test_metadata_columns`
dangotbanned Nov 13, 2024
882af33
Merge remote-tracking branch 'upstream/main' into vega-datasets
dangotbanned Nov 13, 2024
9e9deeb
fix: Warn instead of raise for hit rate limit
dangotbanned Nov 13, 2024
88d4491
Merge remote-tracking branch 'upstream/main' into vega-datasets
dangotbanned Nov 13, 2024
ebc8dec
Merge branch 'main' into vega-datasets
dangotbanned Nov 15, 2024
f2823b4
Merge branch 'main' into vega-datasets
dangotbanned Nov 15, 2024
fa5bea8
feat: Update for `v2.11.0`
dangotbanned Nov 16, 2024
95582df
feat: Always use `pl.read_csv(try_parse_dates=True)`
dangotbanned Nov 16, 2024
dc4a230
feat: Adds `_pl_read_json_roundtrip`
dangotbanned Nov 16, 2024
7ddb2a8
feat(DRAFT): Adds infer-based `altair.datasets.load`
dangotbanned Nov 17, 2024
9544d9b
refactor: Rename `Loader.with_backend` -> `Loader.from_backend`
dangotbanned Nov 18, 2024
7b3a89e
feat(DRAFT): Add optional `backend` parameter for `load(...)`
dangotbanned Nov 18, 2024
c835c13
feat(DRAFT): Adds `altair.datasets.url`
dangotbanned Nov 20, 2024
0817ff8
feat: Support `url(...)` without dependencies
dangotbanned Nov 20, 2024
e01fdd7
fix(DRAFT): Don't generate csv on refresh
dangotbanned Nov 20, 2024
0c5195e
test: Replace rogue `NotImplementedError`
dangotbanned Nov 20, 2024
5595d90
fix: Omit `.gz` last modification time header
dangotbanned Nov 21, 2024
9f62151
docs: Add doc for `Application.write_csv_gzip`
dangotbanned Nov 21, 2024
1bd4552
revert: Remove `"polars[pyarrow]" backend
dangotbanned Nov 21, 2024
11da9c8
test: Add a complex `xfail` for `test_load_call`
dangotbanned Nov 21, 2024
694ada0
refactor: Renaming/recomposing `_readers.py`
dangotbanned Nov 22, 2024
6f41c7e
build: Generate `VERSION_LATEST`
dangotbanned Nov 22, 2024
88d06a6
feat: Adds `_cache.py` for `UrlCache`, `DatasetCache`
dangotbanned Nov 22, 2024
a0d2df4
Merge remote-tracking branch 'upstream/main' into vega-datasets
dangotbanned Nov 22, 2024
f21b52b
ci(ruff): Ignore `0.8.0` violations
dangotbanned Nov 22, 2024
de03046
Merge remote-tracking branch 'upstream/main' into vega-datasets
dangotbanned Nov 22, 2024
e7974d9
fix: Use stable `narwhals` imports
dangotbanned Nov 22, 2024
8ba48a9
Merge branch 'main' into vega-datasets
dangotbanned Nov 23, 2024
9d97096
Merge branch 'main' into vega-datasets
dangotbanned Nov 24, 2024
a698de9
Merge remote-tracking branch 'upstream/main' into vega-datasets
dangotbanned Nov 24, 2024
c907dc5
revert(ruff): Ignore `0.8.0` violations
dangotbanned Nov 24, 2024
a3b38c4
revert: Remove `_readers._filter`
dangotbanned Nov 24, 2024
a6c5096
feat: Adds example and tests for disabling caching
dangotbanned Nov 24, 2024
71423ea
refactor: Tidy up `DatasetCache`
dangotbanned Nov 24, 2024
7dd9c18
docs: Finish `Loader.cache`
dangotbanned Nov 24, 2024
a982759
refactor(typing): Use `Mapping` instead of `dict`
dangotbanned Nov 24, 2024
d20e9c1
perf: Use `to_list()` for all backends
dangotbanned Nov 30, 2024
909e7d0
feat(DRAFT): Utilize `datapackage` schemas in `pandas` backends
dangotbanned Dec 2, 2024
d93fda1
Merge remote-tracking branch 'upstream/main' into vega-datasets
dangotbanned Dec 2, 2024
9274284
refactor(ruff): Apply `TC006` fixes in new code
dangotbanned Dec 2, 2024
8e232b8
docs(DRAFT): Add notes on `datapackage.features_typing`
dangotbanned Dec 2, 2024
9330895
docs: Update `Loader.from_backend` example w/ dtypes
dangotbanned Dec 2, 2024
caf534d
feat: Use `_pl_read_json_roundtrip` instead of `pl.read_json` for `py…
dangotbanned Dec 2, 2024
75bf2ba
docs: Replace example dataset
dangotbanned Dec 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion altair/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -603,6 +603,7 @@
"core",
"data",
"data_transformers",
"datasets",
"datum",
"default_data_transformer",
"display",
Expand Down Expand Up @@ -651,7 +652,7 @@ def __dir__():
from altair.jupyter import JupyterChart
from altair.expr import expr
from altair.utils import AltairDeprecationWarning, parse_shorthand, Undefined
from altair import typing, theme
from altair import datasets, theme, typing


def load_ipython_extension(ipython):
Expand Down
102 changes: 102 additions & 0 deletions altair/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
from __future__ import annotations

from typing import TYPE_CHECKING

from altair.datasets._loader import Loader

if TYPE_CHECKING:
import sys
from typing import Any

if sys.version_info >= (3, 11):
from typing import LiteralString
else:
from typing_extensions import LiteralString

from altair.datasets._loader import _Load
from altair.datasets._typing import Dataset, Extension, Version


__all__ = ["Loader", "load", "url"]


load: _Load[Any, Any]
"""
For full IDE completions, instead use:

from altair.datasets import Loader
load = Loader.from_backend("polars")
cars = load("cars")
movies = load("movies")

Alternatively, specify ``backend`` during a call:

from altair.datasets import load
cars = load("cars", backend="polars")
movies = load("movies", backend="polars")

Related
-------
- https://github.com/vega/altair/pull/3631#issuecomment-2480832609
- https://github.com/vega/altair/pull/3631#discussion_r1847111064
- https://github.com/vega/altair/pull/3631#discussion_r1847176465
"""


def url(
name: Dataset | LiteralString,
suffix: Extension | None = None,
/,
tag: Version | None = None,
) -> str:
"""
Return the address of a remote dataset.

Parameters
----------
name
Name of the dataset/`Path.stem`_.
suffix
File extension/`Path.suffix`_.

.. note::
Only needed if ``name`` is available in multiple formats.
tag
Version identifier for a `vega-datasets release`_.

.. _Path.stem:
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.stem
.. _Path.suffix:
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix
.. _vega-datasets release:
https://github.com/vega/vega-datasets/releases

Related
-------
- https://github.com/vega/altair/pull/3631#issuecomment-2484826592
- https://github.com/vega/altair/pull/3631#issuecomment-2480832711
- https://github.com/vega/altair/discussions/3150#discussioncomment-11280516
- https://github.com/vega/altair/pull/3631#discussion_r1846662053
"""
from altair.datasets._readers import AltairDatasetsError

try:
from altair.datasets._loader import load

url = load.url(name, suffix, tag=tag)
except AltairDatasetsError:
from altair.datasets._cache import url_cache

url = url_cache[name]

return url


def __getattr__(name):
if name == "load":
from altair.datasets._loader import load

return load
else:
msg = f"module {__name__!r} has no attribute {name!r}"
raise AttributeError(msg)
Loading
Loading