-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
long translation segments truncated #93
Comments
Hi, Good to hear OPUS-CAT is working for you. The problems you are having with it are probably because there is an input truncation setting in the Marian NMT decoder that is used in OPUS-CAT. This is normal in NMT, since translating very long sentences is slow, and also the model has not been trained to handle very long sentences, since very long sentences are usually removed from training data. This has been an issue before, and there is a version of OPUS-CAT where there is a functionality for working around the problem of really long input, you can get it from here: https://github.com/Helsinki-NLP/OPUS-CAT/releases/tag/engine_v1.2.4 In this version, long input sentences are split into smaller chunks, and the translations are then merged (this can cause some grammaticality problems in the merging points). Here are instructions for enabling the functionality: The splitting feature is not on by default, you can enable it in the OpusCatMTEngine.exe.config file (in the same folder as the OPUS-CAT executable). The relevant configuration parameters are the following: MaxLength: Default value is 200 (this refers to subword units, so it's less than 200 real words). FixUnbalancedLongTranslations: Default value if False, changing the value to True will enable the splitting feature. UnbalancedSplitPatterns: This is a list of patterns that will be used to split the source sentence when the translation is significantly shorter than the source text. The splitting algorithm iterates the list from the top, looking for instances of each pattern in the source sentence. When it finds a match or multiple matches for a pattern, it will split the source sentence into two at the location of the centermost match. The two parts are then translated separately, and they can also be recursively split into smaller parts if the translations continue to be significantly shorter than the source text. UnbalancedSplitMinLength: Default value is 100. This is the minimum source sentence length in subword units for the splitting function to be applied. The motivation for this limit is that for relatively short source sentences the translation might legitimately be much shorter, but in longer sentences the lengths tend to even out. UnbalancedSplitLengthRatio: Default value is 1.5. This is the ratio of source text length to translation length that determines whether the translation is considered to be too short. So if the source text length is 150, the translation is considered to be too short if its length is 100 or less. |
Hi Tommi,
|
Thanks for your feedback, looks like the long sentence handling should be improved, so I'll mark this as an enhancement for the future. |
Hi,
I am a professional freelance translator (en>de) having translated mainly patents for many years. I am also an amateur python coder and have built my own MT-tools to help with my translation work. So far, no other MT system I tried has come close enough to my own search-and-replace/rule-based system to consider switching. Until I recently came across OPUS-CAT, that is.
After fine-tuning one of the basic OPUS models with my massive patent TMX file the results are stunning, even scary for someone making a living with translation. I decided to try OPUS-CAT more thoroughly. I tested it with the SDL Trados plugin and found a way to successfully transform a fine-tuned OPUS-CAT model for CTranslate2 in order to use it with my own python code. There were, of course, a number of issues along these routes, which I could solve all but one:
Input of long English segments (>60 words, very common in patents) seem to produce consistently translations that appear to be truncated after a certain, but still varying, length/number of words. Although I didn't really know what I was doing, I tried a few different configuration parameter setting for fine-tuning the model as well for using the model with CTranslate2 with no success at all.
Hopefully you have some good ideas or pointers on how to make OPUS-CAT digesting English segments of up to 400 words and producing a German result of that full length.
The text was updated successfully, but these errors were encountered: