Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate Nested RecordSets in favor of repeated subField #750

Open
benjelloun opened this issue Sep 27, 2024 · 0 comments
Open

Deprecate Nested RecordSets in favor of repeated subField #750

benjelloun opened this issue Sep 27, 2024 · 0 comments

Comments

@benjelloun
Copy link
Contributor

The Croissant Spec allows nesting RecordSets inside RecordSets, by using a field with dataType="cr:RecordSet"

https://docs.mlcommons.org/croissant/docs/croissant-spec.html#nested-records

This mechanism has not been used much, is not supported in the mlcroissant library, and adds unneeded complexity.

Instead, we propose using the existing subField mechanism, and specifying repeated=true to represent multiple records.

Here is an example based on the one in the above documentation:

{
  "@type": "cr:RecordSet",
  "@id": "movies_with_ratings",
  "key": { "@id": "movies_with_ratings/movie_id" },
  "field": [
    {
      "@type": "cr:Field",
      "@id": "movies_with_ratings/movie_id",
      "source": { "@id": "movies/movie_id" }
      "references" :  { "@id": "ratings/movie_id" }
    },
    {
      "@type": "cr:Field",
      "@id": "movies_with_ratings/movie_title",
      "source": { "@id": "movies/title" }
    },
    {
      "@type": "cr:Field",
      "@id": "movies_with_ratings/ratings",
      "repeated": "true",
      "subField": [
        {
          "@type": "cr:Field",
          "@id": "movies_with_ratings/ratings/user_id",
          "source": { "@id": "ratings/user_id" }
        },
        {
          "@type": "cr:Field",
          "@id": "movies_with_ratings/ratings/rating",
          "source": { "@id": "ratings/rating" }
        },
        {
          "@type": "cr:Field",
          "@id": "movies_with_ratings/ratings/timestamp",
          "source": { "@id": "ratings/timestamp" }
        }
      ]
    }
  ]
}

Note that using a repeated field with subFields also enables us to get rid of the cumbersome "parentField" property in the previous syntax. Instead, the join with the underlying ratings table is specified on the "movie_id" property.

@benjelloun benjelloun converted this from a draft issue Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant