Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we rename weather.csv? #633

Open
dangotbanned opened this issue Nov 28, 2024 · 9 comments
Open

Can we rename weather.csv? #633

dangotbanned opened this issue Nov 28, 2024 · 9 comments
Labels

Comments

@dangotbanned
Copy link
Member

Following #631, I was surprised to find that we have two very different datasets named "weather".
I'd expected datasets sharing the same base name/stem to represent the same source.
"flights-200k" is the only other duplicated name - but both the .arrow and .json files represent the same data.

I'm thinking ahead towards (vega/altair#3631 (comment)), where there may be datasets with .json and .parquet versions.
In that world, a guarantee on a shared stem representing the same data would provide options to resolve incompatibilities in (vega/altair#3631 (comment))

I do understand renaming would have to be a breaking change, and should not be taken lightly.

Usage

The following uses jsdelivr-stats looking at the range of the past year.

Currently, altair-viz/vega_datasets is by far the greatest source of traffic - which is pinned on v1.29.0.

Versions

Version Requests
v1.29.0 6,802,781
v1.31.1 215,574
v2.7.0 190,102

v1.29.0

There is a weather.csv for this version - but is not accessible using the python api.
"weather" is defined as a reference to weather.json only.

The effect this has is quite apparent when comparing the traffic per-file:

Dataset (v1.29.0) Requests
movies.json 1,804,118
population.json 654,426
... ...
40 most popular 5,000+
weather.json 2,154
weather.csv 31

Given that v1.29.0 is baked into the package - none of this usage would be impacted by the change.

jsdelivr-npm seems to be able to handle version-fallback in the event that anyone is using @latest - so maybe the impact of this change is fairly limited?

Descriptions

weather.json

vega-datasets/SOURCES.md

Lines 403 to 405 in 719c388

## `weather.json`
Instructional dataset showing actual and predicted temperature data.

vega-datasets/datapackage.json

Lines 1207 to 1215 in 719c388

{
"name": "weather",
"type": "json",
"path": "weather.json",
"scheme": "file",
"format": "json",
"mediatype": "text/json",
"encoding": "utf-8"
},

weather.csv

vega-datasets/SOURCES.md

Lines 407 to 409 in 719c388

## `weather.csv`
Data from [NOAA](http://www.ncdc.noaa.gov/cdo-web/datatools/findstation). Transformed using `/scripts/weather.py`. We synthesized the categorical "weather" field from multiple fields in the original dataset. This data is intended for instructional purposes.

vega-datasets/datapackage.json

Lines 1679 to 1718 in 719c388

"name": "weather",
"type": "table",
"path": "weather.csv",
"scheme": "file",
"format": "csv",
"mediatype": "text/csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "location",
"type": "string"
},
{
"name": "date",
"type": "date"
},
{
"name": "precipitation",
"type": "number"
},
{
"name": "temp_max",
"type": "number"
},
{
"name": "temp_min",
"type": "number"
},
{
"name": "wind",
"type": "number"
},
{
"name": "weather",
"type": "string"
}
]
}
},

Side note

Having a unique name per-resource is part of the spec we haven't met yet.
I'm less interested in that part as we can just include the suffix if needed.

@dsmedia
Copy link
Collaborator

dsmedia commented Nov 29, 2024

Just for context on the spec itself, below is the language from datapackage.org, which advises against, but doesn't appear to preclude, a resource name differing from its filename (ex-extension). It sounds like a good practice to keep the filename roots distinct, which would have to be weighed against the cost of a breaking change to an existing filename. Probably a call for @domoritz .

A resource MUST contain a name property. The name is a simple name or identifier to be used for this resource.

It MUST be unique amongst all resources in this data package.

It SHOULD be human-readable and consist only of lowercase English alphanumeric characters plus ., - and _.

It would be usual for the name to correspond to the file name (minus the extension) of the data file the resource describes.

@domoritz
Copy link
Member

I think we can rename but should do a major version bump of it's not backwards compatible. I'm okay with doing that alongside moving to esm for example.

@dsmedia
Copy link
Collaborator

dsmedia commented Dec 2, 2024

@dangotbanned wrote:

"flights-200k" is the only other duplicated name - but both the .arrow and .json files represent the same data.

Side note

Having a unique name per-resource is part of the spec we haven't met yet. I'm less interested in that part as we can just include the suffix if needed.

It does seem like a limitation of Frictionless Data to not accommodate multiple file formats of the same underlying dataset under a single logical name. Perhaps the Frictionless Data specification maintainers will address this use case in a future version.

For now, though, since we're already going to introduce a breaking change to resolve weather, should we bite the bullet and resolve the name conflict for flights-200k, the only remaining spec-breaking duplicate name, in order to be compliant with the Frictionless spec?

We could follow a simple priority rule where the JSON version keeps the base name flights-200k while the Arrow version becomes flights-200k-arrow, following a general precedence of, perhaps, CSV > JSON > Parquet > Arrow for base names. This would:

  • Keep the most human-readable format as the default name
  • Only add format suffixes when needed to resolve conflicts
  • Achieve full spec compliance with minimal renaming

Would it make sense to handle both naming conflicts in a single breaking change?

@domoritz
Copy link
Member

domoritz commented Dec 2, 2024

Yeah, single breaking change.

We should raise an issue with the frictionless data group that we can follow to update here when it's out.

@dangotbanned
Copy link
Member Author

We could follow a simple priority rule where the JSON version keeps the base name flights-200k while the Arrow version becomes flights-200k-arrow, following a general precedence of, perhaps, CSV > JSON > Parquet > Arrow for base names.

@dsmedia I realise I didn't explain why I suggested weather.csv should be renamed but weather.json should not.
weather.json was added 2 years prior to weather.csv

Could you explain the reasoning for flights-200k-arrow vs flights-200k.arrow?
Both seems to allowed per the definition of name, but the former introduces a fourth way to identify a arrow resource.

We can currently use:

  1. path
  2. format
  3. mediatype

"name": "flights-200k",
"type": "table",
"path": "flights-200k.arrow",
"scheme": "file",
"format": "arrow",
"mediatype": "application/vnd.apache.arrow.file",
"schema": {


As an aside, I've got some local changes to (vega/altair#3631) that are utilizing datapackage.json already.
For that, I'm exclusively using path to extract the name as there is another minor issue that name must be lowercase:

It MUST consist only of lowercase alphanumeric characters plus “.”, “-” and “_”.

I don't think that is an issue for the URL, but there are some camelCase names I'd preserved in their python representation:

https://github.com/vega/altair/blob/a982759715061c436ea93aea8234cd04dfca4657/altair/datasets/_typing.py#L72-L74

@dsmedia
Copy link
Collaborator

dsmedia commented Dec 3, 2024

My concern is only with the non-unique "name" appearing twice in the resources portion of the json. (The filenames themselves are fine in this case.)

{
"name": "flights-200k",
"type": "table",

{
"name": "flights-200k",
"type": "table",

Here is the relevant json spec:
name [required]

A resource MUST contain a name property. The name is a simple name or identifier to be used for this resource.
It MUST be unique amongst all resources in this data package.

My suggestion above was just for how the script would be designed to avoid duplicate Resources names.

@dangotbanned
Copy link
Member Author

dangotbanned commented Dec 3, 2024

(#633 (comment))

@dsmedia I'm going to try and illustrate the issues I have a little better.

This is how each form is interpreted by pathlib.Path

from pathlib import Path

CURRENT = "flights-200k"
DOT = "flights-200k.arrow" # (https://github.com/vega/vega-datasets/issues/633#issuecomment-2511517256)
HYPHEN = "flights-200k-arrow" # (https://github.com/vega/vega-datasets/issues/633#issuecomment-2511334248)

>>> Path(CURRENT).name, Path(DOT).name, Path(HYPHEN).name
('flights-200k', 'flights-200k.arrow', 'flights-200k-arrow')

>>> Path(CURRENT).stem, Path(DOT).stem, Path(HYPHEN).stem
('flights-200k', 'flights-200k', 'flights-200k-arrow')

>>> Path(CURRENT).suffix, Path(DOT).suffix, Path(HYPHEN).suffix
('', '.arrow', '')

The first problem is that using a - breaks these basic splitting operations.
By using a . we can very easily detect representations of the same dataset:

>>> Path("flights-200k.json").stem == Path("flights-200k.arrow").stem
True

The second issue relates to:

We could follow a simple priority rule where the JSON version keeps the base name flights-200k while the Arrow version becomes flights-200k-arrow
following a general precedence of, perhaps, CSV > JSON > Parquet > Arrow for base names.

Important

I'd consider this a leaky abstraction

If one wants to use the name for anything, they need to check:

  1. Does the name contain any -?
    • Remember that - could be part of the actual file name
  2. Does the final segment ("seg1-seg2-seg3") match one of these established suffixes?
    • Are there any new suffixes since last checking?
    • Has the precedence changed?
  3. Are there any other names with a similar prefix?

The additional complexity wouldn't be something I'd want to model in (altair#3631/commits/909e7d0).
I would continue using path as I could more reliably derive the name and detect collisions.
There would be no need to concern myself with how vega-datasets prioritizes naming conventions.


My suggestion above was just for how the script would be designed to avoid duplicate Resources names.

For this specific case, we'd only need to make the following change:

diff --git a/datapackage.json b/datapackage.json
index 2105719..4158b86 100644
--- a/datapackage.json
+++ b/datapackage.json
@@ -625,7 +625,7 @@
       }
     },
     {
-      "name": "flights-200k",
+      "name": "flights-200k.arrow",
       "type": "table",
       "path": "flights-200k.arrow",
       "scheme": "file",
diff --git a/scripts/build_datapackage.py b/scripts/build_datapackage.py
index 30834a6..ffb7661 100755
--- a/scripts/build_datapackage.py
+++ b/scripts/build_datapackage.py
@@ -188,7 +188,7 @@ class ResourceAdapter:
     def _extract_file_parts(cls, source: Path, /) -> dict[PathMeta, str]:
         """Metadata that can be inferred from the file path *alone*."""
         parts = {
-            "name": source.stem,
+            "name": source.name,
             "path": source.name,
             "format": source.suffix[1:],
             "scheme": "file",

A more generalized solution (doing this for formats other than arrow) would look a little different.
Something like:

  1. Check for duplicated Path.stem upfront
  2. In those cases, use Path.name (include the suffix)

Although, I think it would be simpler to just use Path.name in all cases - regardless of duplicates.

We already have the guarantee of uniqueness provided by the filesystem.
Adding or removing alternate formats will not impact the name of an existing file

@domoritz
Copy link
Member

domoritz commented Dec 3, 2024

The issue for doing major version bumps: vega/vega#3990. I'm planning to get to this over the winter break. We can include breaking changes from file renaming etc as well so let me know what you decide makes the most sense.

@dsmedia
Copy link
Collaborator

dsmedia commented Dec 4, 2024

@dangotbanned This sounds quite reasonable: "Path.name in all cases - regardless of duplicates." The filesystem-based uniqueness guarantee is ideal. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants