Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check how the best make use of git branches for palaeographic snapshot of the TEI corpus #65

Open
geoffroy-noel-ddh opened this issue Jun 3, 2024 · 4 comments
Assignees
Labels
data enhancement New feature or request MUST

Comments

@geoffroy-noel-ddh
Copy link
Member

geoffroy-noel-ddh commented Jun 3, 2024

Requirements:

  1. we want to create a snapshot of the TEI corpus for the palaeographic annotation purpose
  2. the original branch will keep being edited over time
  3. the snapshot will remain the same, ensuring that existing annotations are not invalidated
  4. ideally we'd like to only snapshot a subset of the corpus, then gradually add more files to it from the main branch. This matches the batch upload currently done on a regular basis to the annotating environment.
  5. we may want to apply corrections from the main branch to the snapshot, selectively to some files
  6. we may also want to update the entire snapshot at some point int the future to realign it with the main branch; this may introduce breaking changes to the annotations (e.g. change in token ID or text structure)
  7. important assumption that all changes to a snapshot always come from the main branch. There is no direct manual edit to the snapshot (e.g. applying correction manually or adding more information to the file). The snapshot is read-only.
@geoffroy-noel-ddh geoffroy-noel-ddh self-assigned this Jun 3, 2024
@geoffroy-noel-ddh geoffroy-noel-ddh added this to the Top 3 priorities milestone Jun 3, 2024
@geoffroy-noel-ddh geoffroy-noel-ddh added enhancement New feature or request MUST data labels Jun 3, 2024
@geoffroy-noel-ddh
Copy link
Member Author

geoffroy-noel-ddh commented Jun 19, 2024

Available operations using git:

Those two operations are very simple and safe. The second may require passing a lot of file names in case of a bulk update. If helpful, this could possibly be scripted.

Important caveat to second operation: this is not a git merge, but a blunt overwrite. The series of commits from main, their comments, dates and authors will be lost by the flattening into a single overwrite.

@geoffroy-noel-ddh
Copy link
Member Author

geoffroy-noel-ddh commented Jun 20, 2024

@simonastoyanova & @JonPrag I've tested the git solution above and I think it should work ok for your needs. The main caveat being that when the snapshot is updated (in bulk or not) the history of changes is not ported to that branch. The changes coming from the editorial branch are flattened without metadata.

But that should be acceptable if the snapshot is never edited manually and the files in the main branch contain last modification dates or a version number.

@simonastoyanova
Copy link
Collaborator

That works for me. The changes will be traceable through the main branch if someone wanted to double check a new reading. The snapshot won't be edited manually, if I correct a reading based on palaeographic analysis, I will go change the original file in the main branch and port it back to the snapshot (don't expect this to happen often but it has once already, so useful as a scenario).

@simonastoyanova
Copy link
Collaborator

Some further comments after our meeting today:

  1. main branch: tokenise all files asap and apply sequential 5+ ids in @n on each token
  2. palaeo branch: replace the old ids in @xml:id with the new ones in @n
  3. palaeo branch: replace the old ids in the annotation files

In terms of changes being ported from the main branch to the palaeo branch, the hashing system should work with some testing as refinement; to recognise change of id, to recognise change of token etc., to be decided and tried.

@geoffroy-noel-ddh geoffroy-noel-ddh removed this from the Top 3 priorities milestone Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data enhancement New feature or request MUST
Projects
None yet
Development

No branches or pull requests

2 participants