Can we rename `weather.csv`? #633

dangotbanned · 2024-11-28T20:08:27Z

Following #631, I was surprised to find that we have two very different datasets named "weather".
I'd expected datasets sharing the same base name/stem to represent the same source.
"flights-200k" is the only other duplicated name - but both the .arrow and .json files represent the same data.

I'm thinking ahead towards (vega/altair#3631 (comment)), where there may be datasets with .json and .parquet versions.
In that world, a guarantee on a shared stem representing the same data would provide options to resolve incompatibilities in (vega/altair#3631 (comment))

I do understand renaming would have to be a breaking change, and should not be taken lightly.

Usage

The following uses jsdelivr-stats looking at the range of the past year.

Currently, altair-viz/vega_datasets is by far the greatest source of traffic - which is pinned on v1.29.0.

Versions

Version	Requests
`v1.29.0`	6,802,781
`v1.31.1`	215,574
`v2.7.0`	190,102

`v1.29.0`

There is a weather.csv for this version - but is not accessible using the python api.
"weather" is defined as a reference to weather.json only.

The effect this has is quite apparent when comparing the traffic per-file:

Dataset (`v1.29.0`)	Requests
`movies.json`	1,804,118
`population.json`	654,426
...	...
40 most popular	5,000+
`weather.json`	2,154
`weather.csv`	31

Given that v1.29.0 is baked into the package - none of this usage would be impacted by the change.

jsdelivr-npm seems to be able to handle version-fallback in the event that anyone is using @latest - so maybe the impact of this change is fairly limited?

Descriptions

weather.json

vega-datasets/SOURCES.md

Lines 403 to 405 in 719c388

    
           ## `weather.json` 
        
           Instructional dataset showing actual and predicted temperature data.

vega-datasets/datapackage.json

Lines 1207 to 1215 in 719c388

    
           { 
        
             "name": "weather", 
        
             "type": "json", 
        
             "path": "weather.json", 
        
             "scheme": "file", 
        
             "format": "json", 
        
             "mediatype": "text/json", 
        
             "encoding": "utf-8" 
        
           },

weather.csv

vega-datasets/SOURCES.md

Lines 407 to 409 in 719c388

    
           ## `weather.csv` 
        
           Data from [NOAA](http://www.ncdc.noaa.gov/cdo-web/datatools/findstation). Transformed using `/scripts/weather.py`. We synthesized the categorical "weather" field from multiple fields in the original dataset. This data is intended for instructional purposes.

vega-datasets/datapackage.json

Lines 1679 to 1718 in 719c388

    
             "name": "weather", 
        
             "type": "table", 
        
             "path": "weather.csv", 
        
             "scheme": "file", 
        
             "format": "csv", 
        
             "mediatype": "text/csv", 
        
             "encoding": "utf-8", 
        
             "schema": { 
        
               "fields": [ 
        
                 { 
        
                   "name": "location", 
        
                   "type": "string" 
        
                 }, 
        
                 { 
        
                   "name": "date", 
        
                   "type": "date" 
        
                 }, 
        
                 { 
        
                   "name": "precipitation", 
        
                   "type": "number" 
        
                 }, 
        
                 { 
        
                   "name": "temp_max", 
        
                   "type": "number" 
        
                 }, 
        
                 { 
        
                   "name": "temp_min", 
        
                   "type": "number" 
        
                 }, 
        
                 { 
        
                   "name": "wind", 
        
                   "type": "number" 
        
                 }, 
        
                 { 
        
                   "name": "weather", 
        
                   "type": "string" 
        
                 } 
        
               ] 
        
             } 
        
           },

Side note

Having a unique name per-resource is part of the spec we haven't met yet.
I'm less interested in that part as we can just include the suffix if needed.

The text was updated successfully, but these errors were encountered:

dsmedia · 2024-11-29T03:19:55Z

Just for context on the spec itself, below is the language from datapackage.org, which advises against, but doesn't appear to preclude, a resource name differing from its filename (ex-extension). It sounds like a good practice to keep the filename roots distinct, which would have to be weighed against the cost of a breaking change to an existing filename. Probably a call for @domoritz .

A resource MUST contain a name property. The name is a simple name or identifier to be used for this resource.

It MUST be unique amongst all resources in this data package.

It SHOULD be human-readable and consist only of lowercase English alphanumeric characters plus ., - and _.

It would be usual for the name to correspond to the file name (minus the extension) of the data file the resource describes.

domoritz · 2024-11-29T23:54:39Z

I think we can rename but should do a major version bump of it's not backwards compatible. I'm okay with doing that alongside moving to esm for example.

dsmedia · 2024-12-02T11:55:16Z

@dangotbanned wrote:

"flights-200k" is the only other duplicated name - but both the .arrow and .json files represent the same data.

Side note

Having a unique name per-resource is part of the spec we haven't met yet. I'm less interested in that part as we can just include the suffix if needed.

It does seem like a limitation of Frictionless Data to not accommodate multiple file formats of the same underlying dataset under a single logical name. Perhaps the Frictionless Data specification maintainers will address this use case in a future version.

For now, though, since we're already going to introduce a breaking change to resolve weather, should we bite the bullet and resolve the name conflict for flights-200k, the only remaining spec-breaking duplicate name, in order to be compliant with the Frictionless spec?

We could follow a simple priority rule where the JSON version keeps the base name flights-200k while the Arrow version becomes flights-200k-arrow, following a general precedence of, perhaps, CSV > JSON > Parquet > Arrow for base names. This would:

Keep the most human-readable format as the default name
Only add format suffixes when needed to resolve conflicts
Achieve full spec compliance with minimal renaming

Would it make sense to handle both naming conflicts in a single breaking change?

domoritz · 2024-12-02T12:35:08Z

Yeah, single breaking change.

We should raise an issue with the frictionless data group that we can follow to update here when it's out.

dangotbanned · 2024-12-02T13:15:48Z

We could follow a simple priority rule where the JSON version keeps the base name flights-200k while the Arrow version becomes flights-200k-arrow, following a general precedence of, perhaps, CSV > JSON > Parquet > Arrow for base names.

@dsmedia I realise I didn't explain why I suggested weather.csv should be renamed but weather.json should not.
weather.json was added 2 years prior to weather.csv

Could you explain the reasoning for flights-200k-arrow vs flights-200k.arrow?
Both seems to allowed per the definition of name, but the former introduces a fourth way to identify a arrow resource.

We can currently use:

path
format
mediatype

vega-datasets/datapackage.json

Lines 628 to 634 in 8745f5c

    
           "name": "flights-200k", 
        
           "type": "table", 
        
           "path": "flights-200k.arrow", 
        
           "scheme": "file", 
        
           "format": "arrow", 
        
           "mediatype": "application/vnd.apache.arrow.file", 
        
           "schema": {

As an aside, I've got some local changes to (vega/altair#3631) that are utilizing datapackage.json already.
For that, I'm exclusively using path to extract the name as there is another minor issue that name must be lowercase:

It MUST consist only of lowercase alphanumeric characters plus “.”, “-” and “_”.

I don't think that is an issue for the URL, but there are some camelCase names I'd preserved in their python representation:

https://github.com/vega/altair/blob/a982759715061c436ea93aea8234cd04dfca4657/altair/datasets/_typing.py#L72-L74

dsmedia · 2024-12-03T00:52:48Z

My concern is only with the non-unique "name" appearing twice in the resources portion of the json. (The filenames themselves are fine in this case.)

vega-datasets/datapackage.json

Lines 597 to 599 in 8745f5c

    
           { 
        
             "name": "flights-200k", 
        
             "type": "table",

vega-datasets/datapackage.json

Lines 627 to 629 in 8745f5c

    
           { 
        
             "name": "flights-200k", 
        
             "type": "table",

Here is the relevant json spec:
name [required]

A resource MUST contain a name property. The name is a simple name or identifier to be used for this resource.
It MUST be unique amongst all resources in this data package.

My suggestion above was just for how the script would be designed to avoid duplicate Resources names.

dangotbanned · 2024-12-03T13:24:57Z

(#633 (comment))

@dsmedia I'm going to try and illustrate the issues I have a little better.

This is how each form is interpreted by pathlib.Path

from pathlib import Path

CURRENT = "flights-200k"
DOT = "flights-200k.arrow" # (https://github.com/vega/vega-datasets/issues/633#issuecomment-2511517256)
HYPHEN = "flights-200k-arrow" # (https://github.com/vega/vega-datasets/issues/633#issuecomment-2511334248)

>>> Path(CURRENT).name, Path(DOT).name, Path(HYPHEN).name
('flights-200k', 'flights-200k.arrow', 'flights-200k-arrow')

>>> Path(CURRENT).stem, Path(DOT).stem, Path(HYPHEN).stem
('flights-200k', 'flights-200k', 'flights-200k-arrow')

>>> Path(CURRENT).suffix, Path(DOT).suffix, Path(HYPHEN).suffix
('', '.arrow', '')

The first problem is that using a - breaks these basic splitting operations.
By using a . we can very easily detect representations of the same dataset:

>>> Path("flights-200k.json").stem == Path("flights-200k.arrow").stem
True

The second issue relates to:

We could follow a simple priority rule where the JSON version keeps the base name flights-200k while the Arrow version becomes flights-200k-arrow
following a general precedence of, perhaps, CSV > JSON > Parquet > Arrow for base names.

Important

I'd consider this a leaky abstraction

If one wants to use the name for anything, they need to check:

Does the name contain any -?
- Remember that - could be part of the actual file name
Does the final segment ("seg1-seg2-seg3") match one of these established suffixes?
- Are there any new suffixes since last checking?
- Has the precedence changed?
Are there any other names with a similar prefix?

The additional complexity wouldn't be something I'd want to model in (altair#3631/commits/909e7d0).
I would continue using path as I could more reliably derive the name and detect collisions.
There would be no need to concern myself with how vega-datasets prioritizes naming conventions.

My suggestion above was just for how the script would be designed to avoid duplicate Resources names.

For this specific case, we'd only need to make the following change:

diff --git a/datapackage.json b/datapackage.json
index 2105719..4158b86 100644
--- a/datapackage.json
+++ b/datapackage.json
@@ -625,7 +625,7 @@
       }
     },
     {
-      "name": "flights-200k",
+      "name": "flights-200k.arrow",
       "type": "table",
       "path": "flights-200k.arrow",
       "scheme": "file",
diff --git a/scripts/build_datapackage.py b/scripts/build_datapackage.py
index 30834a6..ffb7661 100755
--- a/scripts/build_datapackage.py
+++ b/scripts/build_datapackage.py
@@ -188,7 +188,7 @@ class ResourceAdapter:
     def _extract_file_parts(cls, source: Path, /) -> dict[PathMeta, str]:
         """Metadata that can be inferred from the file path *alone*."""
         parts = {
-            "name": source.stem,
+            "name": source.name,
             "path": source.name,
             "format": source.suffix[1:],
             "scheme": "file",

A more generalized solution (doing this for formats other than arrow) would look a little different.
Something like:

Check for duplicated Path.stem upfront
In those cases, use Path.name (include the suffix)

Although, I think it would be simpler to just use Path.name in all cases - regardless of duplicates.

We already have the guarantee of uniqueness provided by the filesystem.
Adding or removing alternate formats will not impact the name of an existing file

domoritz · 2024-12-03T21:46:51Z

The issue for doing major version bumps: vega/vega#3990. I'm planning to get to this over the winter break. We can include breaking changes from file renaming etc as well so let me know what you decide makes the most sense.

dsmedia · 2024-12-04T02:39:27Z

@dangotbanned This sounds quite reasonable: "Path.name in all cases - regardless of duplicates." The filesystem-based uniqueness guarantee is ideal. Thanks!

dangotbanned added the question label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we rename `weather.csv`? #633

Can we rename `weather.csv`? #633

dangotbanned commented Nov 28, 2024

dsmedia commented Nov 29, 2024

domoritz commented Nov 29, 2024

dsmedia commented Dec 2, 2024

Side note

domoritz commented Dec 2, 2024

dangotbanned commented Dec 2, 2024

dsmedia commented Dec 3, 2024

dangotbanned commented Dec 3, 2024 •

edited

Loading

domoritz commented Dec 3, 2024

dsmedia commented Dec 4, 2024

Can we rename weather.csv? #633

Can we rename weather.csv? #633

Comments

dangotbanned commented Nov 28, 2024

Usage

Versions

v1.29.0

Descriptions

Side note

dsmedia commented Nov 29, 2024

domoritz commented Nov 29, 2024

dsmedia commented Dec 2, 2024

Side note

domoritz commented Dec 2, 2024

dangotbanned commented Dec 2, 2024

dsmedia commented Dec 3, 2024

dangotbanned commented Dec 3, 2024 • edited Loading

domoritz commented Dec 3, 2024

dsmedia commented Dec 4, 2024

Can we rename `weather.csv`? #633

Can we rename `weather.csv`? #633

`v1.29.0`

dangotbanned commented Dec 3, 2024 •

edited

Loading