-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing minus signs and post-edit regex rule #67
Comments
Hi, The problem is that $1 copies the first capture group in the post-edit pattern, but there are no capture groups defined in the post-edit pattern. The capturing group that you want to use is in the source pattern, and in post-edit replacement that group can be copied by using $<1>. So changing the Post-edit replacement value to -$<1> should resolve the issue. Btw., it might be the space after the minus that is causing the minus to be dropped, so it might be easier to solve this using a pre-editing rule that removes the space after the minus sign (this might work better if there are multiple problematic minus signs in the same sentence). |
Hello Tommi I decided to simplify my rule as the brackets don't really matter. What matters is that if I have a minus followed by a digit, then the minus sign is not lost So I tried a simplified rule which works when there is one digit like "-5", but as you can see below it is invasive when there are several digits like 500 While I understand what's happening, I still don't see why as in the source pattern, I've clearly stated that this rule should only be applied if the digit is proceeded by a minus sign And in "-500", only the first digit (5) is preceded by a minus sign so why is the rule also applied to the other digits (0s) that follow. Sorry to be a pain but regex is not easy and I only learned the basics a few years ago. I'm not a programmer so it's a struggle |
In post-edit rules, the source pattern is only considered a condition for applying the rule, so if the source text matches the pattern, the rule is applied to the MT output as many times as possible. Since the post-edit pattern specified (\d) applies to every single digit, the replacement is also performed for every digit in the MT output, that's why a minus sign is added before each digit. You can solve this problem by using a post-edit pattern that matches all adjacent digits at once instead of matching each digit separately. The regex operator for achieving that is the plus sign. To match all strings of consecutive digits, use \d+. Since numbers may have spaces, commas or periods in addition to digits, it may be necessary to use a pattern that matches all of those characters, like this: [\d,. ]+ |
Hello Tommi I don't think that works either cos if I have
or am I missing something? |
Yes, that's a limitation of regular expressions, they are difficult to target exactly. That's why in your case it might be better to use a pre-edit rule that formats the source number in such a way as to nudge the MT to use the minus sign correctly. The post-edit minus sign correction is more useful in cases where you expect there to be just one number in the segment. |
Hello Tommi I did manage to write a regex with an atomic group in it that works for numbers like It may be ineloquent but it's: (\d{1,3})((?>\s\d{3}))?((?>\s\d{3}))? I'm not sure what causes this problem in the source text so I'd have to look at that more before attempting a source rule Thanks for all your help and I think I'll have another question soon if you don't mind |
Hello
I had a job today where minus signs (hyphens) before numbers were missing if the string occured inside brackets, e.g
(- 2 500 TSEK)
translated as
(TSEK 2 500)
instead of
(TSEK -2 500)
So I want to write my first regex rule (post editing) for OPUS but it does not work.
Can anyway tell me why?
Thanks
The text was updated successfully, but these errors were encountered: