-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement version tracking system for CanProCo data #86
Comments
Thanks Julien - in the last couple of days we have changed our BIDS data structure to be centralized and contain both M0 and M12 timepoints, as opposed to previously having separate M0 and M12 data storage locations (and having to send separate zipfiles for each timepoint). This was done to allow one data directory which can be git-annex tracked as per your suggestion. In the next coming days, I will send this M0-M12 data in the new structure over UBC onedrive, and add you to it's git-annex repository on github, so any future changes can be tracked. Please let us know how this sounds. |
Fantastic! Thank you @aman-s
If the repos is tracked on git-annex and you can add us to the repos, why sending the data via UBC's onedrive? It defeats the purpose of getting the data directly via git-annex, no? |
@jcohenadad UBC doesn't allow us to put the actual datafiles on github, therefore, we were planning to have just the filepointer and hashes, for tracking purposes, of zipped-datafiles on github We are also open to any other suggestions which may be helpful for your team in streamlining! |
I think there is a misunderstanding. I did not suggest that I would pull the binary files from GH, but that I would clone/checkout/pull the repository that includes the pointers to the actual data (which is the main difference between git and git-annex, ie, the repos is pointing to where the data are). Now, the repos could be public (eg: on GH), while the data require a token to be able to fetch them. See e.g. https://docs.cneuromod.ca/en/latest/ACCESS.html#versioning Alternatively, the git-annex repos could be hosted on your servers, and SSH permissions could be given to external collaborators to fetch the data via git-annex commands (which is compatible with SSH protocol) |
From my understanding, the original purpose of having the GH git-annex of CanProCo spinal data was so we can track any file changes/ past errors that have been fixed retrospectively. UBC fixing errors retrospectively were prone to being lost in communication without version control, which we are hoping the git-annex will help with, by allowing you to do checksums on received data. Second, having a git-annex repository that would allow your team to pull specific versions of the datasets could be another benefit of the git-annex. However, setting up a token to fetch data directly over ssh or hosting a git server, requires more discussion with our lab PIs and IT. Currently, the approved method of data transfer, is via the UBC Onedrive, and the UBC team sending periodic updated data - but we can think about direct git-annex pull requests in the future. Perhaps I can share the new GH git-annex (for version tracking of the files) and the new-combined-M0-data-structure so you get a better sense of how it can offer you checksums for past files? And we can continue our discussion about adding remotes to git-annex in the future. Let me know how that sounds! |
Another useful use case is for researchers to contribute to the dataset, eg, with manual segmentations. Being able to push those segmentations would benefit other researchers (like @leelisae). The modus operandi could be:
By experience, this will save a lot of trouble (ie: minimize human error, and systematizing procedures). One interesting avenue would be to host the repos on OneDrive. Some people have done that. Few relevant links: |
Context
Note
Conversation started via email, but redirected here for transparency and easy cross-referencing. Everyone please feel free to contribute to the discussion!
There have been multiple conversations about issues with the dataset (eg: the exact same images being labeled as M0 and M12: #39), that are being fixed locally, without knowing if the error is also being fixed at the source, and at institutions using the data for analysis.
In addition, some of the issues are being reported to us from another site than the source site, example: #13, so we end up fixing things on our internal server, without knowing that the exact same corrections are also being made at the source site.
Problem: Given that the dataset is not being synced across the multiple user sites, we end up with multiple versions of the dataset that are not being tracked, potentially leading to errors and lack of reproducibility.
Solutions
We should look at ways to version track the data and its usage across all the user sites. The earlier the better (as time passes, errors are being accumulated, making it increasingly difficult to reconstruct the history).
Track source dataset with git-annex
git-annex technology is a popular reference for version-tracking dataset, based on git. It is notably used by Datalad, a reference tool in the neuroimaging community for sharing data and performing reproducible science. An excellent solution would be to convert the source repos as a git-annex repos, and make modifications with regular git commit/push that are trackable.
Several levels of permissions are possible:
git-annex checkout/pull
a specific version of the repos-- pros: less manual work to distribute the data ; cons: possible security issues from source IT management teamI think that option 1 is the most realistic/reasonable given the IT context.
Create manual checksum
If git-annex repos is not possible, or while it is being implemented, a "quick and dirty" solution is for the source site to create a Checksum of all files in the dataset (recursively), which could be done with:
find * -type f -exec shasum -b -a 256 {} \; > CANPROCO_vX.Y
And then the sums can be verified by collaborative sites with:
shasum -c -a 256 CANPROCO_vX.Y
Additional usage
We should also consider that other sites might contribute to the dataset, eg, with manual labels of segmentations. Using git-annex would be a means to push the segmentations to the source repository, so that it could also serve other sites for analysis.
Resource
Examples of multi-site data managed with git-annex
The text was updated successfully, but these errors were encountered: