-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Localization change tracking for fluent #89
Comments
Okay I have a proposal about how this change tracking could work. Requirements
DesignI would propose that we have an associated json file per translated file, or perhaps a single json file per language with a subsection per file. Within this file, it contains an entry per message in the translated files with the following data:
Terminology:
To detect whether a message is currently in sync the following steps are taken: flowchart TD
IS["In Sync"]
OSC["Out of Sync (Changed)"]
OSR["Out of Sync (Renamed)"]
S[Start] --> E{Does the hash of primary\n match the file hash?}
E --> |Yes| IS
E --> |No| F{Does the message in primary match\n primary checked out at git hash?}
F --> |Yes| IS
F --> |No| B{"Does the git hash match LCP?"}
B -->|Yes| IS
B -->|No| C{Does the message in the\n primary file checked out\n at git hash match the\n same message at LCP?}
C --> |Yes| IS
C --> |No| D{In the commit after git hash\n where the message was deleted from\n primary file, are there any new messages\n matching the message in git hash?}
D --> |Yes| OSR
D --> |No| OSC
To mark an item as in-sync, all that needs to happen is to set the file hash to the hash of the primary file, and the git hash to the LCP. |
We will need a way for translators to interact with this system. The long term goal is to create a GUI for #83 which could be used to do this, however sending the source code and the git repository to translators may not be appropriate. There there should also be a way to pre-build this information into a tree structure and export to a json file that can be "edited" with a single change overlayed to be re-integrated. Without a GUI tool to perform this, perhaps a viable alternative could be to generate an excel spreadsheet which can be sent to the translator containing only the required changes and associated comments, and a way to re-integrate it using the |
Perhaps the design could be simplified by not using git hashes, but rather by searching back through the history to find the file that matches the file hash. |
Not duplicating data like checksums that could be deduced through other tooling would be a huge advantage. I do suggest looking into that. Otherwise almost the entire thing I read makes a lot of sense and sounds great. I can't wait for tooling (both CLI and GUI) to support this. |
I'm not sure I understand, would you be able to elaborate on that? Do you mean, if available, try to re-use the checksums for the file history if they are already available in your version control system? Perhaps we could make the source of checksums pluggable somehow? |
Sure, happy to try to explain and maybe even help work on this.
Not quite, I was actually thinking about preempting that so that it is always available ahead of time. Back to this is a second.
Yes, that would actually be ideal. Like most developers these days I'm pretty heavily invested in Git so that was the use case I had in mind, but making the entire checksum system pluggable would make it possible to implement within any VCS (or none). The default provider could be the Git one talked about here while leaving room for some other way of drumming up checksums that serve a specific purpose. So back to Git and file checksums. Your original outline included storing two checksums, one commit SHA and one file hash SHA. I understand why both the last commit is useful and why the commit hash is not available before commit (which is when you'd need to store an updated value without a two commit system).
But why not have your cake and eat it too? Instead of storing either of those values I suggest storing an object hash generated by Git. You can generate such a hash for any arbitrary file (tracked or untracked) using $ date > myfile
$ git hash-object myfile
119142f1bf27dcb9e059495206c64c404db90af4
$ git add myfile
$ git commit -m "Track and commit"
[master 4344ad1] Track and commit
1 file changed, 1 insertion(+)
create mode 100644 myfile
$ git log --raw --all --find-object=119142f1bf27dcb9e059495206c64c404db90af4
commit 4344ad1 (HEAD -> master)
Author: Caleb Maclennan <[email protected]>
Date: Wed Nov 22 09:47:14 2023 +0300
Track and commit
:000000 100644 0000000 119142f A myfile Hence my suggestion was to only store one checksum. In the case of Git that one value can be the object hash and used to identify (and retrieve) the exact file contents before or after committing and also to look up the commit history for it. |
I would definitely love to collaborate with more people on this project if you have some time! We can create diagram soon so we can agree on exactly how it should work and it could be used in the documentation for the system later.
Ah yes I saw this In my other comment: Perhaps the design could be simplified by not using git hashes, but rather by searching back through the history to find the file that matches the file hash. I was thinking perhaps to do away with any reliance on git hashing entirely and hash the file ourselves on demand, and only rely on git (or any other VCS) to produce the previous versions of the file that we can hash ourselves with whatever hashing algorithm we want. This would be more computationally expensive but probably not noticeable in practice. But with your suggestion we could instead re-use git's hashing mechanism. I don't know enough about Git but I presume the hashes of previous versions of these files are also stored somewhere in the
That sounds very good to me!
Brilliant 🙂 Edit wrong emoji! |
Edit previous post: wrong emoji! |
As far as making this pluggable goes, we probably want to store what the hash scheme is along with the hash. That way tooling will know what VCS/hash system to use: { "primary-version": { "scheme": "git-object", "hash": "119142f1bf27dcb9e059495206c64c404db90af4" }} Or like { "primary-version": "git-object#119142f1bf27dcb9e059495206c64c404db90af4" } I'd go for the former myself, but then I'd also avoid JSON like the plague if I had my druthers. Either way it's the same information with different parsing trade-offs. Its still better than storing two hashes, one of them potentially one step out of date and the other being hard to retrieve for old versions. |
What about forgoing the external JSON data and attaching this meta information directly to translations with some predefined format in a code comment in the translation file itself? This would be a trade off in pain points of course, but it would mean you didn't have to store the message key separately at all, didn't have to worry about tracking it separately, could be more human readable than a separate file with just meta information, etc. I suppose the major trade off is it would make it harder to mix and match with other tooling that was not aware of the meaning in the comments and might blow away the comments altogether. I don't know how common that is. For Fluent the language specs a type of comment that stays attached to messages so tooling should support such a use case already. For gettext things are a bit more ad-hoc (although personally I don't care, Fluent being the only current sane localization system in the space right now with MF2 being a potential future peer). |
I agree, the first example you listed looks good, I prefer the denormalized format too. I also agree JSON definitely has its problems, but at least the universality of tooling for it means that people are less likely to be worried about being locked into using our system, and they can easily manipulate it in automated workflows we haven't yet conceived of.
I did consider this, and it's a great idea, but I am assuming many parsing libraries for formats like these don't really support the editing workflow very well (I could be wrong and it would be worth validating this assumption), especially not without reformatting the entire file which may mess with version control and this change tracking system itself. I suppose this is something we will have to grapple with later if we want to have a GUI for editing translations, but I wouldn't want to limit this change tracking to systems that have a suitable parsing library. Also if we go with an external format, then hopefully the implementation on our side across different formats will be more consistent. For the embedded use case where I plan to support formats like json/yaml key/value (that could get transformed into some slim binary format), we perhaps don't have the capability to attach this metadata to the messages and would have to rely on an external file anyway. What do you think? I'm appreciating having someone to bounce these ideas with! |
To add to this proposal I would like to try to make the change tracking generic over the source of messages, and also have a plugable storage backend. I don't currently have a personal use case to motivate me to do this work because https://github.com/kellpossible/avalanche-report has been granted an open source license for crowdin to translate fluent messages, however we soon will need to have a way to translate user generated long form content, so ideally the tools to track translations of this content can be made into a library that can be used in a web server, so if I can combine the two somehow it may be easier for me to be motivated to finish it. |
It would be nice to have some kind of localization change tracking tool/library for fluent that can indicate when messages are out of date.
Related to #31
Related to #83
The text was updated successfully, but these errors were encountered: