Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model output file schema repair functionality #92

Open
3 of 4 tasks
annakrystalli opened this issue Jul 1, 2024 · 0 comments
Open
3 of 4 tasks

Add model output file schema repair functionality #92

annakrystalli opened this issue Jul 1, 2024 · 0 comments

Comments

@annakrystalli
Copy link
Member

annakrystalli commented Jul 1, 2024

! Unsure whether this should live here or in hubAdmin

Background

For a hub to be successfully accessed as an arrow dataset, column data types should not change from round to round.
Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. Custom task IDs however, which are beyond our control, and the output_type_id column have the potential to change and this could indeed cause problems downstream. This is mainly a problem for parquet files (but has a small chance to cause problems in csvs too).

To reduce the chances of this happening/mitigate the effects, a number of actions have been proposed:

Add model output file schema repair functionality

As a future feature, once we have created functionality to inspect a hub for integrity and the following have all been implemented:

we could also add functionality that could repair any data type discrepancies and update files to conform to a changed schema. This could help admins in a situation where all the above fail and a breaking schema change needs to be introduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

1 participant