Add model output file schema repair functionality #92

annakrystalli · 2024-07-01T08:04:01Z

! Unsure whether this should live here or in hubAdmin

Background

For a hub to be successfully accessed as an arrow dataset, column data types should not change from round to round.
Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. Custom task IDs however, which are beyond our control, and the output_type_id column have the potential to change and this could indeed cause problems downstream. This is mainly a problem for parquet files (but has a small chance to cause problems in csvs too).

To reduce the chances of this happening/mitigate the effects, a number of actions have been proposed:

Improve the documentation on this, get admins to think about the issue early on and warn them to avoid changes in data types.
Should propagate the ability to fix the output_type_id column to hubValidations ( Introduce output_type_id_datatype argument across relevant validate_*() fns #91) and consider a property in the schema where hub admins can configure and communicate this setting (Introduce a property to fix the output_type_id column data type across the hub schemas#87).

Add model output file schema repair functionality

As a future feature, once we have created functionality to inspect a hub for integrity and the following have all been implemented:

we could also add functionality that could repair any data type discrepancies and update files to conform to a changed schema. This could help admins in a situation where all the above fail and a breaking schema change needs to be introduced.

The text was updated successfully, but these errors were encountered:

annakrystalli added this to the robust-hub-schema milestone Jul 1, 2024

annakrystalli added this to hubverse Development overview Jul 1, 2024

github-project-automation bot moved this to Todo in hubverse Development overview Jul 1, 2024

annakrystalli added the hub-integrity-check label Jul 2, 2024

annakrystalli mentioned this issue Aug 5, 2024

143 / Update docs to reflect new schema version v3.0.1 hubverse-org/hubDocs#163

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model output file schema repair functionality #92

Add model output file schema repair functionality #92

annakrystalli commented Jul 1, 2024 •

edited

Loading

Add model output file schema repair functionality #92

Add model output file schema repair functionality #92

Comments

annakrystalli commented Jul 1, 2024 • edited Loading

Background

Add model output file schema repair functionality

annakrystalli commented Jul 1, 2024 •

edited

Loading