Word alignment try 2 #267

johnml1135 · 2024-11-05T21:58:46Z

Add word alignment engine to IInteractiveTranslationEngine.

This change is

codecov-commenter · 2024-11-05T22:00:50Z

Codecov Report

Attention: Patch coverage is 38.12155% with 112 lines in your changes missing coverage. Please review.

Project coverage is 69.98%. Comparing base (8319868) to head (f905c2d).

Files with missing lines	Patch %	Lines
src/SIL.Machine/Translation/WordAlignmentResult.cs	0.00%	34 Missing ⚠️
...hine/Translation/SymmetrizedWordAlignmentEngine.cs	70.00%	27 Missing ⚠️
...SIL.Machine/Translation/SymmetrizationHeuristic.cs	0.00%	26 Missing ⚠️
...ine.Translation.Thot/ThotWordAlignmentModelType.cs	0.00%	11 Missing ⚠️
src/SIL.Machine/Corpora/AlignedWordPair.cs	0.00%	9 Missing ⚠️
...Machine.Translation.Thot/ThotWordAlignmentModel.cs	50.00%	4 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #267      +/-   ##
==========================================
- Coverage   70.16%   69.98%   -0.19%     
==========================================
  Files         385      389       +4     
  Lines       31957    32055      +98     
  Branches     4488     4497       +9     
==========================================
+ Hits        22424    22433       +9     
- Misses       8493     8581      +88     
- Partials     1040     1041       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ddaspit

What is this change for?

Reviewable status: 0 of 3 files reviewed, all discussions resolved

johnml1135 · 2024-11-06T16:09:45Z

This is needed for adding the word alignment engine to Serval. Just exposing the alignment endpoints to the interactive engine.

johnml1135 · 2024-11-06T16:11:41Z

This needs to be merged and released before the Serval changes will be able to compile.

ddaspit

I'm still not sure I understand what this is for. There are already interfaces for word alignment models. Also, phrase alignment isn't word alignment. That is specific to the Thot SMT engine.

Reviewable status: 0 of 3 files reviewed, all discussions resolved

johnml1135 · 2024-11-07T15:37:22Z

The ThotSmtModel appears to be the best place to add the alignment routines onto - as the "phrase alignment" just means that the tokenizer can be configured. If I don't use ThotSmtModel, what specific things would I use? IWordAligner assumes that the source and target are already tokenized. Also, how would it interact with loading models built by machine.py?

ddaspit

For word alignment, you should use one of the classes that inherits from ThotWordAlignmentModel. For SMT and word alignment models, you will need to tokenize the text. We should just use the LatinWordTokenizer like we do for the SMT engine.

Reviewable status: 0 of 3 files reviewed, all discussions resolved

johnml1135 · 2024-11-08T21:23:15Z

Hmm. It wold be quite a bit of reworking. I would have to use a different wording than ThotWordAlignmentModel because that is just referring to the asymmetrical alignment, not the symmetrical alignment with tokenizer. In python, the word aligner has the tokenizer connected to it. I could rework the Machine word aligner to have the tokenizer in it, but that would be a fair amount of work. The solution I have appears to be a good minimal solution - treat the ThotSmtModel as a SymmetrizedWordAlignmentModel with tokenizers - it already has the capability of having the truecaser as null.

Otherwise, I think I would have to create base class of ThotSmtModel called ThotSymmetrizedWordAlignmentModelWithTokenizer? in which 1/2 of the functionality of ThotSmtModel is implemented. And even then, all the configurations and trainers and everything else would need to be torn apart and rewritten.

I think this minimal change is the best solution - it looks like a word aligner on Serval but is just an SMT model underneath.

ddaspit

The ThotSmtModel is a full phrased-based SMT system and takes a lot more computation and time to train. The phrase alignment from the SMT model uses a different algorithm than the word alignment models and is much more expensive. Unfortunately, it is not a replacement for the word alignment models. We should meet to discuss how best to proceed. I'm sure if I had a better understanding of what you are trying to achieve, we can come up with a good solution.

Reviewable status: 0 of 3 files reviewed, all discussions resolved

Add tokenizer to trainer

johnml1135 requested a review from ddaspit November 5, 2024 21:58

ddaspit reviewed Nov 5, 2024

View reviewed changes

ddaspit reviewed Nov 6, 2024

View reviewed changes

ddaspit reviewed Nov 7, 2024

View reviewed changes

ddaspit reviewed Nov 9, 2024

View reviewed changes

johnml1135 force-pushed the word_alignment_try_2 branch 3 times, most recently from 3c8ddd6 to 1f29ecd Compare November 27, 2024 17:34

a start

f905c2d

Add tokenizer to trainer

johnml1135 force-pushed the word_alignment_try_2 branch from 1f29ecd to f905c2d Compare December 9, 2024 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word alignment try 2 #267

Word alignment try 2 #267

johnml1135 commented Nov 5, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Nov 5, 2024 •

edited

Loading

ddaspit left a comment

johnml1135 commented Nov 6, 2024

johnml1135 commented Nov 6, 2024

ddaspit left a comment

johnml1135 commented Nov 7, 2024

ddaspit left a comment

johnml1135 commented Nov 8, 2024

ddaspit left a comment

Word alignment try 2 #267

Are you sure you want to change the base?

Word alignment try 2 #267

Conversation

johnml1135 commented Nov 5, 2024 • edited by ddaspit Loading

codecov-commenter commented Nov 5, 2024 • edited Loading

Codecov Report

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 6, 2024

johnml1135 commented Nov 6, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 7, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 8, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 5, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Nov 5, 2024 •

edited

Loading