Missing minus signs and post-edit regex rule #67

SafeTex · 2023-02-22T19:43:07Z

Hello

I had a job today where minus signs (hyphens) before numbers were missing if the string occured inside brackets, e.g
(- 2 500 TSEK)
translated as
(TSEK 2 500)
instead of
(TSEK -2 500)

So I want to write my first regex rule (post editing) for OPUS but it does not work.

Can anyway tell me why?
Thanks

TommiNieminen · 2023-02-22T20:15:33Z

Hi,

The problem is that $1 copies the first capture group in the post-edit pattern, but there are no capture groups defined in the post-edit pattern. The capturing group that you want to use is in the source pattern, and in post-edit replacement that group can be copied by using $<1>. So changing the Post-edit replacement value to -$<1> should resolve the issue.

Btw., it might be the space after the minus that is causing the minus to be dropped, so it might be easier to solve this using a pre-editing rule that removes the space after the minus sign (this might work better if there are multiple problematic minus signs in the same sentence).

SafeTex · 2023-02-23T19:37:26Z

Hello Tommi

I decided to simplify my rule as the brackets don't really matter. What matters is that if I have a minus followed by a digit, then the minus sign is not lost

So I tried a simplified rule which works when there is one digit like "-5", but as you can see below it is invasive when there are several digits like 500

While I understand what's happening, I still don't see why as in the source pattern, I've clearly stated that this rule should only be applied if the digit is proceeded by a minus sign

And in "-500", only the first digit (5) is preceded by a minus sign so why is the rule also applied to the other digits (0s) that follow.

Sorry to be a pain but regex is not easy and I only learned the basics a few years ago. I'm not a programmer so it's a struggle

TommiNieminen · 2023-02-23T20:13:29Z

In post-edit rules, the source pattern is only considered a condition for applying the rule, so if the source text matches the pattern, the rule is applied to the MT output as many times as possible. Since the post-edit pattern specified (\d) applies to every single digit, the replacement is also performed for every digit in the MT output, that's why a minus sign is added before each digit.

You can solve this problem by using a post-edit pattern that matches all adjacent digits at once instead of matching each digit separately. The regex operator for achieving that is the plus sign. To match all strings of consecutive digits, use \d+. Since numbers may have spaces, commas or periods in addition to digits, it may be necessary to use a pattern that matches all of those characters, like this: [\d,. ]+

SafeTex · 2023-02-23T22:59:49Z

Hello Tommi

I don't think that works either cos if I have
-500 000
mistranslated as
500 000
and I change it with a post editing rule to
-500 000
the same rule will also change
600 000
to

600 000
where it should not change it.
I've tested this to confirm my doubt and the result was:

or am I missing something?
Thanks in advance

TommiNieminen · 2023-02-24T09:21:10Z

Yes, that's a limitation of regular expressions, they are difficult to target exactly. That's why in your case it might be better to use a pre-edit rule that formats the source number in such a way as to nudge the MT to use the minus sign correctly. The post-edit minus sign correction is more useful in cases where you expect there to be just one number in the segment.

SafeTex · 2023-02-24T12:29:56Z

Hello Tommi

I did manage to write a regex with an atomic group in it that works for numbers like
5
50
500
5 000
5 000 000
with the minus sign being inserted multiple times in the last two examples and I can easily make that apply to segments with just one number.

It may be ineloquent but it's:

(\d{1,3})((?>\s\d{3}))?((?>\s\d{3}))?

I'm not sure what causes this problem in the source text so I'd have to look at that more before attempting a source rule

Thanks for all your help and I think I'll have another question soon if you don't mind

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing minus signs and post-edit regex rule #67

Missing minus signs and post-edit regex rule #67

SafeTex commented Feb 22, 2023

TommiNieminen commented Feb 22, 2023

SafeTex commented Feb 23, 2023

TommiNieminen commented Feb 23, 2023

SafeTex commented Feb 23, 2023

TommiNieminen commented Feb 24, 2023

SafeTex commented Feb 24, 2023

Missing minus signs and post-edit regex rule #67

Missing minus signs and post-edit regex rule #67

Comments

SafeTex commented Feb 22, 2023

TommiNieminen commented Feb 22, 2023

SafeTex commented Feb 23, 2023

TommiNieminen commented Feb 23, 2023

SafeTex commented Feb 23, 2023

TommiNieminen commented Feb 24, 2023

SafeTex commented Feb 24, 2023