-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversion of SOURCES.md to SOURCES.yaml #634
Comments
@dsmedia just to echo #631 (comment)
If we wanted a non- vega-datasets/scripts/build_datapackage.py Lines 231 to 234 in 719c388
I'm not sure how familiar you are with |
Sounds good to me. I don't mind either format and having automated checks sounds great. |
Might something like this work for a TOML format, containing resource-level (i.e. dataset level) description, source and license information? This is just a proof-of-concept that includes three of the datasets: SOURCES.toml# Package-level license information
[package.license]
name = "BSD-3-Clause"
path = "https://opensource.org/license/bsd-3-clause"
title = "The 3-Clause BSD License"
# Resource metadata
budget.description = "Budget FY 2016 - Receipts data from the Office of Management and Budget (U.S.)"
[budget.sources]
title = "Office of Management and Budget (U.S.)"
path = "https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3"
countries.description = """
This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""
[[countries.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"
[[countries.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"
[countries.licenses]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"
gapminder.description = """
This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""
[[gapminder.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"
[[gapminder.sources]]
title = "Gapminder Foundation - Population"
path = "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676"
version = "v7"
[[gapminder.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"
[[gapminder.sources]]
title = "Gapminder Foundation - Data Geographies"
path = "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158"
version = "v2"
[gapminder.licenses]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International" Here are the relevant excerpts from the data package definition and data resource definition sources
licenses
|
Looks good. Why are some entries in single [ and some in double [[? |
@dsmedia I like the look of the proposed I wanted to provide a comparison to what The main difference is that the bulk of the content is within
|
@domoritz interesting timing on this - hopefully the examples in (#634 (comment)) can explain the nesting. I'm not sure @dsmedia's sample would translate directly into the target schema. This is what
|
Got it. So something like the following? sources.toml$schema = "https://datapackage.org/profiles/2.0/datapackage.json"
# Package-level metadata using inline tables
name = "vega-datasets"
description = "Common repository for example datasets used by Vega related projects."
homepage = "http://github.com/vega/vega-datasets.git"
sources = [
{ path = "https://github.com/vega/vega-datasets/blob/next/SOURCES.md" },
]
contributors = [
{ title = "UW Interactive Data Lab", path = "http://idl.cs.washington.edu" },
]
version = "2.11.0"
created = "2024-12-01T15:50:47.863271+00:00"
[[licenses]]
name = "BSD-3-Clause"
path = "https://opensource.org/license/bsd-3-clause"
title = "The 3-Clause BSD License"
# Resources array
[[resources]]
name = "budget"
path = "budget.json"
description = "Budget FY 2016 - Receipts data from the Office of Management and Budget (U.S.)"
[[resources.sources]]
title = "Office of Management and Budget (U.S.)"
path = "https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3"
[[resources]]
name = "countries"
path = "countries.json"
description = """
This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""
[[resources.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"
[[resources.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"
[[resources.licenses]]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"
[[resources]]
name = "gapminder"
path = "gapminder.json"
description = """
This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""
[[resources.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"
[[resources.sources]]
title = "Gapminder Foundation - Population"
path = "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676"
version = "v7"
[[resources.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"
[[resources.sources]]
title = "Gapminder Foundation - Data Geographies"
path = "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158"
version = "v2"
[[resources.licenses]]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"
|
@dsmedia #634 (comment) looks good
Happy for this to be moved out of version = "2.11.0"
created = "2024-12-01T15:50:47.863271+00:00"
Whatever is in the But I imagine it could be helpful to manually define the parts we discover are detected incorrectly like: vega-datasets/datapackage.json Lines 542 to 547 in 719c388
vega-datasets/datapackage.json Lines 1229 to 1234 in 719c388
vega-datasets/datapackage.json Lines 2065 to 2070 in 719c388
vega-datasets/datapackage.json Lines 2091 to 2096 in 719c388
vega-datasets/datapackage.json Lines 2425 to 2430 in 719c388
vega-datasets/datapackage.json Lines 2551 to 2556 in 719c388
vega-datasets/datapackage.json Lines 2681 to 2690 in 719c388
|
For the package authors, we might want to change all of our packages to be "the Vega organization" or something like that. Can we do that separately from this? |
Before migrating the extrinsic metadata of each dataset from markdown to a machine readable format, should we agree on a yaml template that would work well with the new Frictionless tooling? Should we make any required (like sourcing) to ensure future datasets are properly documented before release? Not all have sources now, but we can get those added. What should the yaml file be named?
The text was updated successfully, but these errors were encountered: