-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we rename weather.csv
?
#633
Comments
Just for context on the spec itself, below is the language from datapackage.org, which advises against, but doesn't appear to preclude, a resource name differing from its filename (ex-extension). It sounds like a good practice to keep the filename roots distinct, which would have to be weighed against the cost of a breaking change to an existing filename. Probably a call for @domoritz . A resource MUST contain a name property. The name is a simple name or identifier to be used for this resource. It MUST be unique amongst all resources in this data package. It SHOULD be human-readable and consist only of lowercase English alphanumeric characters plus ., - and _. It would be usual for the name to correspond to the file name (minus the extension) of the data file the resource describes. |
I think we can rename but should do a major version bump of it's not backwards compatible. I'm okay with doing that alongside moving to esm for example. |
It does seem like a limitation of Frictionless Data to not accommodate multiple file formats of the same underlying dataset under a single logical For now, though, since we're already going to introduce a breaking change to resolve We could follow a simple priority rule where the JSON version keeps the base name
Would it make sense to handle both naming conflicts in a single breaking change? |
Yeah, single breaking change. We should raise an issue with the frictionless data group that we can follow to update here when it's out. |
@dsmedia I realise I didn't explain why I suggested Could you explain the reasoning for We can currently use:
vega-datasets/datapackage.json Lines 628 to 634 in 8745f5c
As an aside, I've got some local changes to (vega/altair#3631) that are utilizing
I don't think that is an issue for the URL, but there are some camelCase names I'd preserved in their |
My concern is only with the non-unique vega-datasets/datapackage.json Lines 597 to 599 in 8745f5c
vega-datasets/datapackage.json Lines 627 to 629 in 8745f5c
Here is the relevant json spec:
My suggestion above was just for how the script would be designed to avoid duplicate Resources names. |
@dsmedia I'm going to try and illustrate the issues I have a little better. This is how each form is interpreted by from pathlib import Path
CURRENT = "flights-200k"
DOT = "flights-200k.arrow" # (https://github.com/vega/vega-datasets/issues/633#issuecomment-2511517256)
HYPHEN = "flights-200k-arrow" # (https://github.com/vega/vega-datasets/issues/633#issuecomment-2511334248)
>>> Path(CURRENT).name, Path(DOT).name, Path(HYPHEN).name
('flights-200k', 'flights-200k.arrow', 'flights-200k-arrow')
>>> Path(CURRENT).stem, Path(DOT).stem, Path(HYPHEN).stem
('flights-200k', 'flights-200k', 'flights-200k-arrow')
>>> Path(CURRENT).suffix, Path(DOT).suffix, Path(HYPHEN).suffix
('', '.arrow', '') The first problem is that using a >>> Path("flights-200k.json").stem == Path("flights-200k.arrow").stem
True The second issue relates to:
Important I'd consider this a leaky abstraction If one wants to use the
The additional complexity wouldn't be something I'd want to model in (altair#3631/commits/909e7d0).
For this specific case, we'd only need to make the following change: diff --git a/datapackage.json b/datapackage.json
index 2105719..4158b86 100644
--- a/datapackage.json
+++ b/datapackage.json
@@ -625,7 +625,7 @@
}
},
{
- "name": "flights-200k",
+ "name": "flights-200k.arrow",
"type": "table",
"path": "flights-200k.arrow",
"scheme": "file",
diff --git a/scripts/build_datapackage.py b/scripts/build_datapackage.py
index 30834a6..ffb7661 100755
--- a/scripts/build_datapackage.py
+++ b/scripts/build_datapackage.py
@@ -188,7 +188,7 @@ class ResourceAdapter:
def _extract_file_parts(cls, source: Path, /) -> dict[PathMeta, str]:
"""Metadata that can be inferred from the file path *alone*."""
parts = {
- "name": source.stem,
+ "name": source.name,
"path": source.name,
"format": source.suffix[1:],
"scheme": "file", A more generalized solution (doing this for formats other than Although, I think it would be simpler to just use We already have the guarantee of uniqueness provided by the filesystem. |
The issue for doing major version bumps: vega/vega#3990. I'm planning to get to this over the winter break. We can include breaking changes from file renaming etc as well so let me know what you decide makes the most sense. |
@dangotbanned This sounds quite reasonable: "Path.name in all cases - regardless of duplicates." The filesystem-based uniqueness guarantee is ideal. Thanks! |
Following #631, I was surprised to find that we have two very different datasets named
"weather"
.I'd expected datasets sharing the same base name/stem to represent the same source.
"flights-200k"
is the only other duplicated name - but both the.arrow
and.json
files represent the same data.I'm thinking ahead towards (vega/altair#3631 (comment)), where there may be datasets with
.json
and.parquet
versions.In that world, a guarantee on a shared stem representing the same data would provide options to resolve incompatibilities in (vega/altair#3631 (comment))
I do understand renaming would have to be a breaking change, and should not be taken lightly.
Usage
The following uses jsdelivr-stats looking at the range of the past year.
Currently, altair-viz/vega_datasets is by far the greatest source of traffic - which is pinned on
v1.29.0
.Versions
v1.29.0
v1.31.1
v2.7.0
v1.29.0
There is a
weather.csv
for this version - but is not accessible using thepython
api."weather"
is defined as a reference toweather.json
only.The effect this has is quite apparent when comparing the traffic per-file:
v1.29.0
)movies.json
population.json
weather.json
weather.csv
Given that
v1.29.0
is baked into the package - none of this usage would be impacted by the change.jsdelivr-npm seems to be able to handle version-fallback in the event that anyone is using
@latest
- so maybe the impact of this change is fairly limited?Descriptions
weather.json
vega-datasets/SOURCES.md
Lines 403 to 405 in 719c388
vega-datasets/datapackage.json
Lines 1207 to 1215 in 719c388
weather.csv
vega-datasets/SOURCES.md
Lines 407 to 409 in 719c388
vega-datasets/datapackage.json
Lines 1679 to 1718 in 719c388
Side note
Having a unique
name
per-resource is part of the spec we haven't met yet.I'm less interested in that part as we can just include the suffix if needed.
The text was updated successfully, but these errors were encountered: