Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make file changes visible in the manifest of the parquet exports #1683

Open
2 tasks
manuelwedler opened this issue Oct 2, 2024 · 0 comments
Open
2 tasks

Comments

@manuelwedler
Copy link
Collaborator

We got more feedback on Matrix on our parquet export:

I also wonder: it looks to me like the respective last parquet file will keep increasing content-wise until its "full", is that right? If that's true, it would be nice to be able to avoid having to redownload everything if it didn't change - something like hashes in the manifest or such (but there's probably a better way even)
Although I guess I could also include the last file downloaded and redownload that as well when updating... everything except the last one I guess shouldn't change over time (short of schema changes :-))?

It would be nice to be able to see in the manifest if a file changed after the last download.
Possible additions to the manifest to achieve this:

  • schema version
  • timestamp of the last change of the file
  • parquet file hash
  • row count

I am also not sure if the parquet files are generated incrementally and the files are generated in order. Would be good to have a look into this and if only rows to the last file are added.

Tasks

  • Investigate if parquet files are only updated incrementally
  • Implement generation of above manifest fields

Related

#1668

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

1 participant