Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion of SOURCES.md to SOURCES.yaml #634

Open
dsmedia opened this issue Nov 29, 2024 · 9 comments
Open

Conversion of SOURCES.md to SOURCES.yaml #634

dsmedia opened this issue Nov 29, 2024 · 9 comments

Comments

@dsmedia
Copy link
Collaborator

dsmedia commented Nov 29, 2024

Before migrating the extrinsic metadata of each dataset from markdown to a machine readable format, should we agree on a yaml template that would work well with the new Frictionless tooling? Should we make any required (like sourcing) to ensure future datasets are properly documented before release? Not all have sources now, but we can get those added. What should the yaml file be named?

@dangotbanned
Copy link
Member

@dsmedia just to echo #631 (comment)

Is there a big benefit to including the yaml in addition to json? json is much more common (and the only of the two natively supported in python/js) and the readability difference is small that I would say let's only have json.

@domoritz having yaml doesn't benefit me personally, just thought I'd provide the options @dsmedia mentioned in #629 (comment):

Just thinking out loud, but instead of directly maintaining the sources.md file, we could keep the dataset metadata in a json or yaml file, and generate the sources.md file from this machine-readable format.

I'm happy with just json

If we wanted a non-json format, I'd suggest .toml since it is natively supported in python.
For the extrinsic fields you mentioned in (#631 (comment)), I imagine the toml-array-of-tables syntax would be handy.

class ResourceMeta(TypedDict, total=False):
description: str
sources: Sequence[Source]
licenses: Sequence[License]

I'm not sure how familiar you are with TypedDict(s), but you can enforce any required-and-notrequired constraints you like on the hierarchy I started in build_datapackage.py

@domoritz
Copy link
Member

Sounds good to me. I don't mind either format and having automated checks sounds great.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Dec 1, 2024

Might something like this work for a TOML format, containing resource-level (i.e. dataset level) description, source and license information? This is just a proof-of-concept that includes three of the datasets: budget.json, countries.json, and gapminder.json. (I've also pulled into this file the package-level license information now hard-coded into the generation script file, to separate configuration from code.) I assume that the generation script will be able to match these to the resources with their resource name (i.e. the filename without the extension). At a later stage, resource-level column descriptions (for tabular data) could be incorporated into the TOML file (to supplement the column names and types identified by the script).

SOURCES.toml
# Package-level license information
[package.license]
name = "BSD-3-Clause"
path = "https://opensource.org/license/bsd-3-clause"
title = "The 3-Clause BSD License"

# Resource metadata

budget.description = "Budget FY 2016 - Receipts data from the Office of Management and Budget (U.S.)"

[budget.sources]
title = "Office of Management and Budget (U.S.)" 
path = "https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3"

countries.description = """
This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""

[[countries.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"

[[countries.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"

[countries.licenses]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"

gapminder.description = """
This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""

[[gapminder.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"

[[gapminder.sources]]
title = "Gapminder Foundation - Population"
path = "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676"
version = "v7"

[[gapminder.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"

[[gapminder.sources]]
title = "Gapminder Foundation - Data Geographies"
path = "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158"
version = "v2"

[gapminder.licenses]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"

Here are the relevant excerpts from the data package definition and data resource definition

sources

sources (resource-level)
List of data sources as for Data Package. If not specified the resource inherits from the data package.

sources (package-level)
The raw sources for this data package. It MUST be an array of Source objects. A Source object MUST have at least one property. A Source object is RECOMMENDED to have title property and MAY have path, email, and version properties:

  • title: A string containing a title of the source (e.g. document or organization name).
  • path: A URL or Path, that is a fully qualified HTTP address, or a relative POSIX path.
  • email: A string containing an email address.
  • version: A string containing a version of the source.
    An example of the object structure is as follows:

"sources": [{
"title": "World Bank and OECD",
"path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}]

licenses

licenses (resource-level)

List of licenses as for Data Package. If not specified the resource inherits from the data package.

licenses (package-level)

The license(s) under which the package is provided.

Caution

This property is not legally binding and does not guarantee the package is licensed under the terms defined in this property.

licenses MUST be an array. Each item in the array is a License. Each MUST be an object. The object MUST contain a name property and/or a path property, and it MAY contain a title property:

  • name: A string containing an Open Definition license ID
  • path: A URL or Path, that is a fully qualified HTTP address, or a relative POSIX path.
  • title: A string containing human-readable title.
    An example of using the licenses property:

"licenses": [{
"name": "ODC-PDDL-1.0",
"path": "http://opendatacommons.org/licenses/pddl/",
"title": "Open Data Commons Public Domain Dedication and License v1.0"
}]

@domoritz
Copy link
Member

domoritz commented Dec 1, 2024

Looks good. Why are some entries in single [ and some in double [[?

@dangotbanned
Copy link
Member

dangotbanned commented Dec 1, 2024

@dsmedia I like the look of the proposed SOURCES.toml.

I wanted to provide a comparison to what datapackage.json would look like in toml.

The main difference is that the bulk of the content is within [[resources]] tables (array of tables)

datapackage.toml

name = "vega-datasets"
description = "Common repository for example datasets used by Vega related projects."
homepage = "http://github.com/vega/vega-datasets.git"
sources = [
    { path = "https://github.com/vega/vega-datasets/blob/next/SOURCES.md" },
]
contributors = [
    { title = "UW Interactive Data Lab", path = "http://idl.cs.washington.edu" },
]
version = "2.11.0"
created = "2024-12-01T15:50:47.863271+00:00"

[[licenses]]
name = "BSD-3-Clause"
path = "https://opensource.org/license/bsd-3-clause"
title = "The 3-Clause BSD License"

[[resources]]
name = "7zip"
type = "file"
path = "7zip.png"
scheme = "file"
format = "png"
mediatype = "image/png"
encoding = "utf-8"

[[resources]]
name = "airports"
type = "table"
path = "airports.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "iata", type = "string" },
    { name = "name", type = "string" },
    { name = "city", type = "string" },
    { name = "state", type = "string" },
    { name = "country", type = "string" },
    { name = "latitude", type = "number" },
    { name = "longitude", type = "number" },
]

[[resources]]
name = "annual-precip"
type = "json"
path = "annual-precip.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[[resources]]
name = "anscombe"
type = "table"
path = "anscombe.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Series", type = "string" },
    { name = "X", type = "integer" },
    { name = "Y", type = "number" },
]

[[resources]]
name = "barley"
type = "table"
path = "barley.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "yield", type = "number" },
    { name = "variety", type = "string" },
    { name = "year", type = "integer" },
    { name = "site", type = "string" },
]

[[resources]]
name = "birdstrikes"
type = "table"
path = "birdstrikes.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "Airport Name", type = "string" },
    { name = "Aircraft Make Model", type = "string" },
    { name = "Effect Amount of damage", type = "string" },
    { name = "Flight Date", type = "date" },
    { name = "Aircraft Airline Operator", type = "string" },
    { name = "Origin State", type = "string" },
    { name = "Phase of flight", type = "string" },
    { name = "Wildlife Size", type = "string" },
    { name = "Wildlife Species", type = "string" },
    { name = "Time of day", type = "string" },
    { name = "Cost Other", type = "integer" },
    { name = "Cost Repair", type = "integer" },
    { name = "Cost Total $", type = "integer" },
    { name = "Speed IAS in knots", type = "integer" },
]

[[resources]]
name = "budget"
type = "table"
path = "budget.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Source Category Code", type = "integer" },
    { name = "Source category name", type = "string" },
    { name = "Source subcategory", type = "integer" },
    { name = "Source subcategory name", type = "string" },
    { name = "Agency code", type = "integer" },
    { name = "Agency name", type = "string" },
    { name = "Bureau code", type = "integer" },
    { name = "Bureau name", type = "string" },
    { name = "Account code", type = "integer" },
    { name = "Account name", type = "string" },
    { name = "Treasury Agency code", type = "integer" },
    { name = "On- or off-budget", type = "string" },
    { name = "1962", type = "string" },
    { name = "1963", type = "string" },
    { name = "1964", type = "string" },
    { name = "1965", type = "string" },
    { name = "1966", type = "string" },
    { name = "1967", type = "string" },
    { name = "1968", type = "string" },
    { name = "1969", type = "string" },
    { name = "1970", type = "string" },
    { name = "1971", type = "string" },
    { name = "1972", type = "string" },
    { name = "1973", type = "string" },
    { name = "1974", type = "string" },
    { name = "1975", type = "string" },
    { name = "1976", type = "string" },
    { name = "TQ", type = "string" },
    { name = "1977", type = "string" },
    { name = "1978", type = "string" },
    { name = "1979", type = "string" },
    { name = "1980", type = "string" },
    { name = "1981", type = "string" },
    { name = "1982", type = "string" },
    { name = "1983", type = "string" },
    { name = "1984", type = "string" },
    { name = "1985", type = "string" },
    { name = "1986", type = "string" },
    { name = "1987", type = "string" },
    { name = "1988", type = "string" },
    { name = "1989", type = "string" },
    { name = "1990", type = "string" },
    { name = "1991", type = "string" },
    { name = "1992", type = "string" },
    { name = "1993", type = "string" },
    { name = "1994", type = "string" },
    { name = "1995", type = "string" },
    { name = "1996", type = "string" },
    { name = "1997", type = "string" },
    { name = "1998", type = "string" },
    { name = "1999", type = "string" },
    { name = "2000", type = "string" },
    { name = "2001", type = "string" },
    { name = "2002", type = "string" },
    { name = "2003", type = "string" },
    { name = "2004", type = "string" },
    { name = "2005", type = "string" },
    { name = "2006", type = "string" },
    { name = "2007", type = "string" },
    { name = "2008", type = "string" },
    { name = "2009", type = "string" },
    { name = "2010", type = "string" },
    { name = "2011", type = "string" },
    { name = "2012", type = "string" },
    { name = "2013", type = "string" },
    { name = "2014", type = "string" },
    { name = "2015", type = "string" },
    { name = "2016", type = "string" },
    { name = "2017", type = "string" },
    { name = "2018", type = "string" },
    { name = "2019", type = "string" },
    { name = "2020", type = "string" },
]

[[resources]]
name = "budgets"
type = "table"
path = "budgets.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "budgetYear", type = "integer" },
    { name = "forecastYear", type = "integer" },
    { name = "value", type = "number" },
]

[[resources]]
name = "burtin"
type = "table"
path = "burtin.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Bacteria", type = "string" },
    { name = "Penicillin", type = "number" },
    { name = "Streptomycin", type = "number" },
    { name = "Neomycin", type = "number" },
    { name = "Gram_Staining", type = "string" },
    { name = "Genus", type = "string" },
]

[[resources]]
name = "cars"
type = "table"
path = "cars.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Name", type = "string" },
    { name = "Miles_per_Gallon", type = "integer" },
    { name = "Cylinders", type = "integer" },
    { name = "Displacement", type = "number" },
    { name = "Horsepower", type = "integer" },
    { name = "Weight_in_lbs", type = "integer" },
    { name = "Acceleration", type = "number" },
    { name = "Year", type = "date" },
    { name = "Origin", type = "string" },
]

[[resources]]
name = "co2-concentration"
type = "table"
path = "co2-concentration.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "Date", type = "date" },
    { name = "CO2", type = "number" },
    { name = "adjusted CO2", type = "number" },
]

[[resources]]
name = "countries"
type = "table"
path = "countries.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "_comment", type = "string" },
    { name = "year", type = "integer" },
    { name = "fertility", type = "number" },
    { name = "life_expect", type = "number" },
    { name = "n_fertility", type = "number" },
    { name = "n_life_expect", type = "number" },
    { name = "country", type = "string" },
]

[[resources]]
name = "crimea"
type = "table"
path = "crimea.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "date" },
    { name = "wounds", type = "integer" },
    { name = "other", type = "integer" },
    { name = "disease", type = "integer" },
]

[[resources]]
name = "disasters"
type = "table"
path = "disasters.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "Entity", type = "string" },
    { name = "Year", type = "integer" },
    { name = "Deaths", type = "integer" },
]

[[resources]]
name = "driving"
type = "table"
path = "driving.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "side", type = "string" },
    { name = "year", type = "integer" },
    { name = "miles", type = "integer" },
    { name = "gas", type = "number" },
]

[[resources]]
name = "earthquakes"
type = "json"
path = "earthquakes.json"
scheme = "file"
format = "geojson"
mediatype = "text/geojson"
encoding = "utf-8"

[[resources]]
name = "ffox"
type = "file"
path = "ffox.png"
scheme = "file"
format = "png"
mediatype = "image/png"
encoding = "utf-8"

[[resources]]
name = "flare-dependencies"
type = "table"
path = "flare-dependencies.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "source", type = "integer" },
    { name = "target", type = "integer" },
]

[[resources]]
name = "flare"
type = "table"
path = "flare.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "id", type = "integer" },
    { name = "name", type = "string" },
]

[[resources]]
name = "flights-10k"
type = "table"
path = "flights-10k.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "string" },
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
]

[[resources]]
name = "flights-200k"
type = "table"
path = "flights-200k.arrow"
scheme = "file"
format = "arrow"
mediatype = "application/vnd.apache.arrow.file"

[resources.schema]
fields = [
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "time", type = "number" },
]

[[resources]]
name = "flights-200k"
type = "table"
path = "flights-200k.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "time", type = "number" },
]

[[resources]]
name = "flights-20k"
type = "table"
path = "flights-20k.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "string" },
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
]

[[resources]]
name = "flights-2k"
type = "table"
path = "flights-2k.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "string" },
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
]

[[resources]]
name = "flights-3m"
type = "table"
path = "flights-3m.parquet"
scheme = "file"
format = "parquet"
mediatype = "application/parquet"

[resources.schema]
fields = [
    { name = "date", type = "integer" },
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
]

[[resources]]
name = "flights-5k"
type = "table"
path = "flights-5k.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "string" },
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
]

[[resources]]
name = "flights-airport"
type = "table"
path = "flights-airport.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
    { name = "count", type = "integer" },
]

[[resources]]
name = "football"
type = "table"
path = "football.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "date" },
    { name = "division", type = "string" },
    { name = "home_team", type = "string" },
    { name = "away_team", type = "string" },
    { name = "home_score", type = "integer" },
    { name = "away_score", type = "integer" },
]

[[resources]]
name = "gapminder-health-income"
type = "table"
path = "gapminder-health-income.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "country", type = "string" },
    { name = "income", type = "integer" },
    { name = "health", type = "number" },
    { name = "population", type = "integer" },
    { name = "region", type = "string" },
]

[[resources]]
name = "gapminder"
type = "table"
path = "gapminder.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "year", type = "integer" },
    { name = "country", type = "string" },
    { name = "cluster", type = "integer" },
    { name = "pop", type = "integer" },
    { name = "life_expect", type = "number" },
    { name = "fertility", type = "number" },
]

[[resources]]
name = "gimp"
type = "file"
path = "gimp.png"
scheme = "file"
format = "png"
mediatype = "image/png"
encoding = "utf-8"

[[resources]]
name = "github"
type = "table"
path = "github.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "time", type = "string" },
    { name = "count", type = "integer" },
]

[[resources]]
name = "global-temp"
type = "table"
path = "global-temp.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "year", type = "integer" },
    { name = "temp", type = "number" },
]

[[resources]]
name = "income"
type = "table"
path = "income.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "name", type = "string" },
    { name = "region", type = "string" },
    { name = "id", type = "integer" },
    { name = "pct", type = "number" },
    { name = "total", type = "integer" },
    { name = "group", type = "string" },
]

[[resources]]
name = "iowa-electricity"
type = "table"
path = "iowa-electricity.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "year", type = "date" },
    { name = "source", type = "string" },
    { name = "net_generation", type = "integer" },
]

[[resources]]
name = "jobs"
type = "table"
path = "jobs.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "job", type = "string" },
    { name = "sex", type = "string" },
    { name = "year", type = "integer" },
    { name = "count", type = "integer" },
    { name = "perc", type = "number" },
]

[[resources]]
name = "la-riots"
type = "table"
path = "la-riots.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "first_name", type = "string" },
    { name = "last_name", type = "string" },
    { name = "age", type = "integer" },
    { name = "gender", type = "string" },
    { name = "race", type = "string" },
    { name = "death_date", type = "date" },
    { name = "address", type = "string" },
    { name = "neighborhood", type = "string" },
    { name = "type", type = "string" },
    { name = "longitude", type = "number" },
    { name = "latitude", type = "number" },
]

[[resources]]
name = "londonboroughs"
type = "json"
path = "londonBoroughs.json"
scheme = "file"
format = "topojson"
mediatype = "text/topojson"
encoding = "utf-8"

[[resources]]
name = "londoncentroids"
type = "table"
path = "londonCentroids.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "name", type = "string" },
    { name = "cx", type = "number" },
    { name = "cy", type = "number" },
]

[[resources]]
name = "londontubelines"
type = "json"
path = "londonTubeLines.json"
scheme = "file"
format = "topojson"
mediatype = "text/topojson"
encoding = "utf-8"

[[resources]]
name = "lookup_groups"
type = "table"
path = "lookup_groups.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "group", type = "integer" },
    { name = "person", type = "string" },
]

[[resources]]
name = "lookup_people"
type = "table"
path = "lookup_people.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "name", type = "string" },
    { name = "age", type = "integer" },
    { name = "height", type = "integer" },
]

[[resources]]
name = "miserables"
type = "json"
path = "miserables.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[[resources]]
name = "monarchs"
type = "table"
path = "monarchs.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "name", type = "string" },
    { name = "start", type = "integer" },
    { name = "end", type = "integer" },
    { name = "index", type = "integer" },
]

[[resources]]
name = "movies"
type = "table"
path = "movies.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Title", type = "string" },
    { name = "US Gross", type = "integer" },
    { name = "Worldwide Gross", type = "integer" },
    { name = "US DVD Sales", type = "integer" },
    { name = "Production Budget", type = "integer" },
    { name = "Release Date", type = "string" },
    { name = "MPAA Rating", type = "string" },
    { name = "Running Time min", type = "integer" },
    { name = "Distributor", type = "string" },
    { name = "Source", type = "string" },
    { name = "Major Genre", type = "string" },
    { name = "Creative Type", type = "string" },
    { name = "Director", type = "string" },
    { name = "Rotten Tomatoes Rating", type = "integer" },
    { name = "IMDB Rating", type = "number" },
    { name = "IMDB Votes", type = "integer" },
]

[[resources]]
name = "normal-2d"
type = "table"
path = "normal-2d.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "u", type = "number" },
    { name = "v", type = "number" },
]

[[resources]]
name = "obesity"
type = "table"
path = "obesity.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "id", type = "integer" },
    { name = "rate", type = "number" },
    { name = "state", type = "string" },
]

[[resources]]
name = "ohlc"
type = "table"
path = "ohlc.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "date" },
    { name = "open", type = "number" },
    { name = "high", type = "number" },
    { name = "low", type = "number" },
    { name = "close", type = "number" },
    { name = "signal", type = "string" },
    { name = "ret", type = "number" },
]

[[resources]]
name = "penguins"
type = "table"
path = "penguins.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Species", type = "string" },
    { name = "Island", type = "string" },
    { name = "Beak Length (mm)", type = "number" },
    { name = "Beak Depth (mm)", type = "number" },
    { name = "Flipper Length (mm)", type = "integer" },
    { name = "Body Mass (g)", type = "integer" },
    { name = "Sex", type = "string" },
]

[[resources]]
name = "platformer-terrain"
type = "table"
path = "platformer-terrain.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "x", type = "integer" },
    { name = "y", type = "integer" },
    { name = "lumosity", type = "number" },
    { name = "saturation", type = "integer" },
    { name = "name", type = "string" },
    { name = "id", type = "string" },
    { name = "color", type = "string" },
    { name = "key", type = "string" },
]

[[resources]]
name = "points"
type = "table"
path = "points.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "x", type = "number" },
    { name = "y", type = "number" },
]

[[resources]]
name = "political-contributions"
type = "table"
path = "political-contributions.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Candidate_Identification", type = "string" },
    { name = "Candidate_Name", type = "string" },
    { name = "Incumbent_Challenger_Status", type = "string" },
    { name = "Party_Code", type = "integer" },
    { name = "Party_Affiliation", type = "string" },
    { name = "Total_Receipts", type = "number" },
    { name = "Transfers_from_Authorized_Committees", type = "integer" },
    { name = "Total_Disbursements", type = "number" },
    { name = "Transfers_to_Authorized_Committees", type = "number" },
    { name = "Beginning_Cash", type = "number" },
    { name = "Ending_Cash", type = "number" },
    { name = "Contributions_from_Candidate", type = "number" },
    { name = "Loans_from_Candidate", type = "integer" },
    { name = "Other_Loans", type = "integer" },
    { name = "Candidate_Loan_Repayments", type = "number" },
    { name = "Other_Loan_Repayments", type = "integer" },
    { name = "Debts_Owed_By", type = "number" },
    { name = "Total_Individual_Contributions", type = "integer" },
    { name = "Candidate_State", type = "string" },
    { name = "Candidate_District", type = "integer" },
    { name = "Contributions_from_Other_Political_Committees", type = "integer" },
    { name = "Contributions_from_Party_Committees", type = "integer" },
    { name = "Coverage_End_Date", type = "string" },
    { name = "Refunds_to_Individuals", type = "integer" },
    { name = "Refunds_to_Committees", type = "integer" },
]

[[resources]]
name = "population"
type = "table"
path = "population.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "year", type = "integer" },
    { name = "age", type = "integer" },
    { name = "sex", type = "integer" },
    { name = "people", type = "integer" },
]

[[resources]]
name = "population_engineers_hurricanes"
type = "table"
path = "population_engineers_hurricanes.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "state", type = "string" },
    { name = "id", type = "integer" },
    { name = "population", type = "integer" },
    { name = "engineers", type = "number" },
    { name = "hurricanes", type = "integer" },
]

[[resources]]
name = "seattle-weather-hourly-normals"
type = "table"
path = "seattle-weather-hourly-normals.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "date", type = "datetime" },
    { name = "pressure", type = "number" },
    { name = "temperature", type = "number" },
    { name = "wind", type = "number" },
]

[[resources]]
name = "seattle-weather"
type = "table"
path = "seattle-weather.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "date", type = "date" },
    { name = "precipitation", type = "number" },
    { name = "temp_max", type = "number" },
    { name = "temp_min", type = "number" },
    { name = "wind", type = "number" },
    { name = "weather", type = "string" },
]

[[resources]]
name = "sp500-2000"
type = "table"
path = "sp500-2000.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "date", type = "date" },
    { name = "open", type = "number" },
    { name = "high", type = "number" },
    { name = "low", type = "number" },
    { name = "close", type = "number" },
    { name = "adjclose", type = "number" },
    { name = "volume", type = "integer" },
]

[[resources]]
name = "sp500"
type = "table"
path = "sp500.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "date", type = "string" },
    { name = "price", type = "number" },
]

[[resources]]
name = "stocks"
type = "table"
path = "stocks.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "symbol", type = "string" },
    { name = "date", type = "string" },
    { name = "price", type = "number" },
]

[[resources]]
name = "udistrict"
type = "table"
path = "udistrict.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "key", type = "string" },
    { name = "lat", type = "number" },
]

[[resources]]
name = "unemployment-across-industries"
type = "table"
path = "unemployment-across-industries.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "series", type = "string" },
    { name = "year", type = "integer" },
    { name = "month", type = "integer" },
    { name = "count", type = "integer" },
    { name = "rate", type = "number" },
    { name = "date", type = "datetime" },
]

[[resources]]
name = "unemployment"
type = "table"
path = "unemployment.tsv"
scheme = "file"
format = "tsv"
mediatype = "text/tsv"
encoding = "utf-8"

[resources.dialect.csv]
delimiter = "	"

[resources.schema]
fields = [
    { name = "id", type = "integer" },
    { name = "rate", type = "number" },
]

[[resources]]
name = "uniform-2d"
type = "table"
path = "uniform-2d.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "u", type = "number" },
    { name = "v", type = "number" },
]

[[resources]]
name = "us-10m"
type = "json"
path = "us-10m.json"
scheme = "file"
format = "topojson"
mediatype = "text/topojson"
encoding = "utf-8"

[[resources]]
name = "us-employment"
type = "table"
path = "us-employment.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "month", type = "date" },
    { name = "nonfarm", type = "integer" },
    { name = "private", type = "integer" },
    { name = "goods_producing", type = "integer" },
    { name = "service_providing", type = "integer" },
    { name = "private_service_providing", type = "integer" },
    { name = "mining_and_logging", type = "integer" },
    { name = "construction", type = "integer" },
    { name = "manufacturing", type = "integer" },
    { name = "durable_goods", type = "integer" },
    { name = "nondurable_goods", type = "integer" },
    { name = "trade_transportation_utilties", type = "integer" },
    { name = "wholesale_trade", type = "number" },
    { name = "retail_trade", type = "number" },
    { name = "transportation_and_warehousing", type = "number" },
    { name = "utilities", type = "number" },
    { name = "information", type = "integer" },
    { name = "financial_activities", type = "integer" },
    { name = "professional_and_business_services", type = "integer" },
    { name = "education_and_health_services", type = "integer" },
    { name = "leisure_and_hospitality", type = "integer" },
    { name = "other_services", type = "integer" },
    { name = "government", type = "integer" },
    { name = "nonfarm_change", type = "integer" },
]

[[resources]]
name = "us-state-capitals"
type = "table"
path = "us-state-capitals.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "lon", type = "number" },
    { name = "lat", type = "number" },
    { name = "state", type = "string" },
    { name = "city", type = "string" },
]

[[resources]]
name = "volcano"
type = "json"
path = "volcano.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[[resources]]
name = "weather"
type = "table"
path = "weather.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "location", type = "string" },
    { name = "date", type = "date" },
    { name = "precipitation", type = "number" },
    { name = "temp_max", type = "number" },
    { name = "temp_min", type = "number" },
    { name = "wind", type = "number" },
    { name = "weather", type = "string" },
]

[[resources]]
name = "weather"
type = "json"
path = "weather.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[[resources]]
name = "wheat"
type = "table"
path = "wheat.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "year", type = "integer" },
    { name = "wheat", type = "number" },
    { name = "wages", type = "number" },
]

[[resources]]
name = "windvectors"
type = "table"
path = "windvectors.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "longitude", type = "number" },
    { name = "latitude", type = "number" },
    { name = "dir", type = "integer" },
    { name = "dirCat", type = "integer" },
    { name = "speed", type = "number" },
]

[[resources]]
name = "world-110m"
type = "json"
path = "world-110m.json"
scheme = "file"
format = "topojson"
mediatype = "text/topojson"
encoding = "utf-8"

[[resources]]
name = "zipcodes"
type = "table"
path = "zipcodes.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "zip_code", type = "integer" },
    { name = "latitude", type = "number" },
    { name = "longitude", type = "number" },
    { name = "city", type = "string" },
    { name = "state", type = "string" },
    { name = "county", type = "string" },
]

I generated the above with the following diff:

build_datapackage.py changes

diff --git a/scripts/build_datapackage.py b/scripts/build_datapackage.py
index 30834a6..88566b2 100755
--- a/scripts/build_datapackage.py
+++ b/scripts/build_datapackage.py
@@ -5,6 +5,7 @@
 # dependencies = [
 #     "frictionless[json,parquet]",
 #     "polars",
+#     "tomli-w",
 # ]
 # ///
 """
@@ -306,6 +307,19 @@ def iter_resources(data_root: Path, /) -> Iterator[Resource]:
             continue
 
 
+def to_toml(pkg: Package, fp: Path | None = None):
+    import tomli_w
+
+    mapping = pkg.to_dict()
+
+    if fp:
+        fp.touch()
+        with fp.open("wb") as f:
+            tomli_w.dump(mapping, f)
+    else:
+        return tomli_w.dumps(mapping)
+
+
 def main(
     *,
     stem: str = "datapackage",
@@ -333,6 +347,7 @@ def main(
         p = (repo_dir / f"{stem}.yaml").as_posix()
         logger.info(f"Writing {p!r}")
         pkg.to_yaml(p)
+    to_toml(pkg, repo_dir / f"{stem}.toml")
 
 
 if __name__ == "__main__":

countries.json

Below is what the addition of the sources + license in your reply would change in toml

toml

[[resources]]
name = "countries"
type = "table"
path = "countries.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"
description = """
This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "_comment", type = "string" },
    { name = "year", type = "integer" },
    { name = "fertility", type = "number" },
    { name = "life_expect", type = "number" },
    { name = "n_fertility", type = "number" },
    { name = "n_life_expect", type = "number" },
    { name = "country", type = "string" },
]

[[resources.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.licenses]]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"

And then converted back to json

json

{
  "resources": [
    {
      "name": "countries",
      "type": "table",
      "path": "countries.json",
      "scheme": "file",
      "format": "json",
      "mediatype": "text/json",
      "encoding": "utf-8",
      "description": "This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to \"show people the big picture\" rather than support detailed numeric analysis.\r\n",
      "dialect": {
        "json": {
          "keyed": true
        }
      },
      "schema": {
        "fields": [
          {
            "name": "_comment",
            "type": "string"
          },
          {
            "name": "year",
            "type": "integer"
          },
          {
            "name": "fertility",
            "type": "number"
          },
          {
            "name": "life_expect",
            "type": "number"
          },
          {
            "name": "n_fertility",
            "type": "number"
          },
          {
            "name": "n_life_expect",
            "type": "number"
          },
          {
            "name": "country",
            "type": "string"
          }
        ]
      },
      "sources": [
        {
          "title": "Gapminder Foundation - Life Expectancy",
          "path": "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676",
          "version": "v14"
        },
        {
          "title": "Gapminder Foundation - Fertility",
          "path": "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676",
          "version": "v14"
        }
      ],
      "licenses": [
        {
          "name": "CC-BY-4.0",
          "path": "https://www.gapminder.org/free-material/",
          "title": "Creative Commons Attribution 4.0 International"
        }
      ]
    }
  ]
}

Suggestion

Match the Package schema exactly.
But require only "name" "path", ("sources", "licenses", "description", ...) in a [[resources]] table.

Then we can just merge in the intrinsic metadata by matching the path

@dangotbanned
Copy link
Member

dangotbanned commented Dec 1, 2024

Looks good. Why are some entries in single [ and some in double [[?

@domoritz interesting timing on this - hopefully the examples in (#634 (comment)) can explain the nesting.

I'm not sure @dsmedia's sample would translate directly into the target schema.

This is what SOURCES.toml converts to

{
  "package": {
    "license": {
      "name": "BSD-3-Clause",
      "path": "https://opensource.org/license/bsd-3-clause",
      "title": "The 3-Clause BSD License",
      "budget": {
        "description": "Budget FY 2016 - Receipts data from the Office of Management and Budget (U.S.)"
      }
    }
  },
  "budget": {
    "sources": {
      "title": "Office of Management and Budget (U.S.)",
      "path": "https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3",
      "countries": {
        "description": "This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to \"show people the big picture\" rather than support detailed numeric analysis.\n"
      }
    }
  },
  "countries": {
    "sources": [
      {
        "title": "Gapminder Foundation - Life Expectancy",
        "path": "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676",
        "version": "v14"
      },
      {
        "title": "Gapminder Foundation - Fertility",
        "path": "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676",
        "version": "v14"
      }
    ],
    "licenses": {
      "name": "CC-BY-4.0",
      "path": "https://www.gapminder.org/free-material/",
      "title": "Creative Commons Attribution 4.0 International",
      "gapminder": {
        "description": "This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to \"show people the big picture\" rather than support detailed numeric analysis.\n"
      }
    }
  },
  "gapminder": {
    "sources": [
      {
        "title": "Gapminder Foundation - Life Expectancy",
        "path": "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676",
        "version": "v14"
      },
      {
        "title": "Gapminder Foundation - Population",
        "path": "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676",
        "version": "v7"
      },
      {
        "title": "Gapminder Foundation - Fertility",
        "path": "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676",
        "version": "v14"
      },
      {
        "title": "Gapminder Foundation - Data Geographies",
        "path": "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158",
        "version": "v2"
      }
    ],
    "licenses": {
      "name": "CC-BY-4.0",
      "path": "https://www.gapminder.org/free-material/",
      "title": "Creative Commons Attribution 4.0 International"
    }
  }
}

@dsmedia
Copy link
Collaborator Author

dsmedia commented Dec 1, 2024

Got it. So something like the following?

sources.toml
$schema = "https://datapackage.org/profiles/2.0/datapackage.json"

# Package-level metadata using inline tables
name = "vega-datasets"
description = "Common repository for example datasets used by Vega related projects."
homepage = "http://github.com/vega/vega-datasets.git"
sources = [
    { path = "https://github.com/vega/vega-datasets/blob/next/SOURCES.md" },
]
contributors = [
    { title = "UW Interactive Data Lab", path = "http://idl.cs.washington.edu" },
]
version = "2.11.0"
created = "2024-12-01T15:50:47.863271+00:00"

[[licenses]]
name = "BSD-3-Clause"
path = "https://opensource.org/license/bsd-3-clause"
title = "The 3-Clause BSD License"

# Resources array
[[resources]]
name = "budget"
path = "budget.json"
description = "Budget FY 2016 - Receipts data from the Office of Management and Budget (U.S.)"

[[resources.sources]]
title = "Office of Management and Budget (U.S.)" 
path = "https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3"

[[resources]]
name = "countries"
path = "countries.json"
description = """
This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""

[[resources.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.licenses]]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"

[[resources]]
name = "gapminder"
path = "gapminder.json"
description = """
This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""

[[resources.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.sources]]
title = "Gapminder Foundation - Population"
path = "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676"
version = "v7"

[[resources.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.sources]]
title = "Gapminder Foundation - Data Geographies"
path = "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158"
version = "v2"

[[resources.licenses]]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"
  • I added a root-level package data descriptor per my understanding of the recommendations of the spec, and also hard-coded the package-level metadata in the TOML. (I assume it's better to have this here than in the generation script?)
  • @domoritz just wanted to confirm that the package-level contributor metadata in package.json is still current/complete:

"author": {
"name": "UW Interactive Data Lab",
"url": "http://idl.cs.washington.edu"

  • Quick question about the workflow - how should conflicts be handled if there's overlap between metadata in the TOML file and what's automatically detected by the build script? These might be inadvertent, or there might be cases where the intrinsic metadata doesn't generate properly from the script, and

@dangotbanned
Copy link
Member

dangotbanned commented Dec 1, 2024

@dsmedia #634 (comment) looks good

and also hard-coded the package-level metadata in the TOML. (I assume it's better to have this here than in the generation script?)

Happy for this to be moved out of build_datapackage.py, except for the bits that are dynamic like:

version = "2.11.0"
created = "2024-12-01T15:50:47.863271+00:00"

Quick question about the workflow - how should conflicts be handled if there's overlap between metadata in the TOML file and what's automatically detected by the build script?

Whatever is in the .toml should have a higher precedence.
I don't think there would be any conflicts with what you have so far.

But I imagine it could be helpful to manually define the parts we discover are detected incorrectly like:

"schema": {
"fields": [
{
"name": "date",
"type": "string"
},

vega-datasets/datapackage.json

Lines 1229 to 1234 in 719c388

"schema": {
"fields": [
{
"name": "date",
"type": "string"
},

vega-datasets/datapackage.json

Lines 2065 to 2070 in 719c388

"schema": {
"fields": [
{
"name": "date",
"type": "string"
},

vega-datasets/datapackage.json

Lines 2091 to 2096 in 719c388

"schema": {
"fields": [
{
"name": "date",
"type": "string"
},

vega-datasets/datapackage.json

Lines 2425 to 2430 in 719c388

"schema": {
"fields": [
{
"name": "date",
"type": "string"
},

vega-datasets/datapackage.json

Lines 2551 to 2556 in 719c388

"schema": {
"fields": [
{
"name": "date",
"type": "integer"
},

vega-datasets/datapackage.json

Lines 2681 to 2690 in 719c388

"schema": {
"fields": [
{
"name": "symbol",
"type": "string"
},
{
"name": "date",
"type": "string"
},

@domoritz
Copy link
Member

domoritz commented Dec 2, 2024

For the package authors, we might want to change all of our packages to be "the Vega organization" or something like that. Can we do that separately from this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants