Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move preprocess logic to toolkit #512

Merged
merged 31 commits into from
Nov 26, 2024
Merged

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Oct 16, 2024

Fixes #396

How should we go about testing the new Echo engine functionality? Just add integration tests?


This change is Reviewable

@johnml1135
Copy link
Collaborator

src/Serval/src/Serval.Shared/Serval.Shared.csproj line 22 at r1 (raw file):

		<PackageReference Include="Grpc.HealthCheck" Version="2.65.0" />
		<PackageReference Include="Grpc.Net.ClientFactory" Version="2.65.0" />
		<PackageReference Include="SIL.Machine" Version="3.2.8" Condition="!Exists('..\..\..\..\..\machine\src\SIL.Machine\SIL.Machine.csproj')" />

Pull the most recent from main - we are up to Machine 3.4.0 - and there are 4 locations that all need to be updated together.

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 157 at r1 (raw file):

            }

            foreach (Row row in AlignPretranslateCorpus(sourcePretranslateCorpora, targetCorpora[0].TextCorpus))

Pretranslations are not universal. Could this be broken up into a few major pieces - and then one Machine specific function inherit from this class to tie the main pieces together? The main processing of "monocorpus" and "parallelCorpus" are universal - and we should be able to have a few routines to slice and dice these things so that the 10 line linq expression only happens once...

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 179 at r1 (raw file):

    )
    {
        srcCorpora = srcCorpora.Select(sc => sc.Transform(CleanSegment)).ToArray();

So does this follow the logic - Mix the sources, chose the first target? I can't see it clearly - should it be documented here?

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 295 at r1 (raw file):

    }

    private static IEnumerable<Row> AlignPretranslateCorpus(ITextCorpus[] srcCorpora, ITextCorpus trgCorpus)

Align Train and Align Pretranslate. If these will be in the ServiceToolkit, could we change the name to something more generic? AlignTrainCorpus may be good enough, but I would prefer something like "AlignResultsGenerationCorpus" or even more generically, "AlignMixSource" and "AlignChooseFirstSource". Would that work?

@johnml1135
Copy link
Collaborator

Add a new set of unit tests in service toolkit?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 22 of 22 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Fix build error

Update Echo engine to use toolkit

Fix async stream issue

Fix test: Add ability to specify CorpusService mock
@Enkidu93 Enkidu93 force-pushed the move_preprocess_logic_to_toolkit branch from 5a61291 to 3ec2bce Compare October 16, 2024 18:46
Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a unit test in the toolkit. I think ideally we'll port almost all the current PreprocessBuildTests to the toolkit. I'm wondering though if we should try to get this in first and then follow up with that in a separate PR since this PR will be difficult to merge once other changes have been put through.

Reviewable status: 17 of 29 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Echo/src/EchoTranslationEngine/TranslationEngineServiceV1.cs line 98 at r3 (raw file):

                                    },
                                    cancellationToken
                                );

@ddaspit, it might just be because it's late, but it's not clear to me how to go about doing this: I seem to be doing something wrong with how I'm passing in this async delegate and it's causing the Echo engine to hang and not respond. Any thoughts? I've not exhausted my debugging (although the debugger keeps crashing on me), but I thought maybe you'd just see it right off.


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 94 at r3 (raw file):

    [Test]
    public async Task RunAsync_PretranslateTextIdsOverlapWithTrainOnTextIds()

I also fixed a bug regarding this situation and added a test.


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 157 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Pretranslations are not universal. Could this be broken up into a few major pieces - and then one Machine specific function inherit from this class to tie the main pieces together? The main processing of "monocorpus" and "parallelCorpus" are universal - and we should be able to have a few routines to slice and dice these things so that the 10 line linq expression only happens once...

Discussed in meeting


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 179 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So does this follow the logic - Mix the sources, chose the first target? I can't see it clearly - should it be documented here?

That happens elsewhere, but yep, that's the logic. I can add more comments if something seems unclear.


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 295 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Align Train and Align Pretranslate. If these will be in the ServiceToolkit, could we change the name to something more generic? AlignTrainCorpus may be good enough, but I would prefer something like "AlignResultsGenerationCorpus" or even more generically, "AlignMixSource" and "AlignChooseFirstSource". Would that work?

Discussed in meeting.


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 313 at r3 (raw file):

                .SelectMany(sc => trgCorpora.Select(tc => sc.AlignRows(tc, allSourceRows: true)))
                .ZipMany(rows =>
                    rows.Where(r => r.SourceSegment.Count > 0 && r.TargetSegment.Count == 0).FirstOrDefault()

This addresses #514


src/Serval/src/Serval.Shared/Serval.Shared.csproj line 22 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Pull the most recent from main - we are up to Machine 3.4.0 - and there are 4 locations that all need to be updated together.

Done.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 17 of 29 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Echo/src/EchoTranslationEngine/TranslationEngineServiceV1.cs line 98 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

@ddaspit, it might just be because it's late, but it's not clear to me how to go about doing this: I seem to be doing something wrong with how I'm passing in this async delegate and it's causing the Echo engine to hang and not respond. Any thoughts? I've not exhausted my debugging (although the debugger keeps crashing on me), but I thought maybe you'd just see it right off.

Nevermind. It didn't have to do with that at all. I just had an outdated version of Machine locally and was hitting the bug John had fixed in the ZipMany function where it hits and infinite loop.

@johnml1135
Copy link
Collaborator

src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 209 at r4 (raw file):

            Assert.That(termCount, Is.EqualTo(0));
        });
        Assert.That(

Were these tests wrong originally? Why did the values change?

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/test/SIL.ServiceToolkit/Utils/data/source2.txt line 1 at r4 (raw file):

Source two, Line 1

These should be removed from Machine/test, they are no longer needed.

@johnml1135
Copy link
Collaborator

Agreed. Let's just do this small thing and get it released to solve the immediate issue.

@johnml1135
Copy link
Collaborator

src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 94 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I also fixed a bug regarding this situation and added a test.

Great!

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 313 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This addresses #514

Looks good to me.

@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2024

Codecov Report

Attention: Patch coverage is 87.39130% with 29 lines in your changes missing coverage. Please review.

Project coverage is 57.03%. Comparing base (3fedf25) to head (8db0bad).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...kit/Services/ParallelCorpusPreprocessingService.cs 92.12% 7 Missing and 6 partials ⚠️
...hared/Services/ServalTranslationEngineServiceV1.cs 0.00% 5 Missing ⚠️
...kit/Configuration/IServiceCollectionsExtensions.cs 0.00% 5 Missing ⚠️
....Shared/Configuration/IMachineBuilderExtensions.cs 0.00% 4 Missing ⚠️
...ared/Configuration/IServiceCollectionExtensions.cs 0.00% 1 Missing ⚠️
...ed.Tests/Services/ScriptureDataFileServiceTests.cs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #512      +/-   ##
==========================================
- Coverage   57.25%   57.03%   -0.22%     
==========================================
  Files         299      302       +3     
  Lines       15664    15605      -59     
  Branches     2163     2151      -12     
==========================================
- Hits         8968     8900      -68     
- Misses       6054     6063       +9     
  Partials      642      642              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge, everything is working properly now on this branch. The echo engine is only covered by an E2E test, so we should consider either giving up on testing the Echo engine too extensively OR adding an echo-specific suite of tests. I'm not too bothered by this since the logic is now shared via the toolkit (i.e., the tricky part is tested elsewhere). Also, like I mentioned above, the preprocess build job tests should be ported eventually.

Reviewable status: 16 of 31 files reviewed, 10 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 209 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Were these tests wrong originally? Why did the values change?

Most of these are reversions to the values they had pre-mixed-source-support. I was confused about the intended functionality regarding pretranslation, so I changed the values. But with the changes in this PR, they can be changed back. Let me hand reckon the numbers one more time to make sure.


src/Serval/test/Serval.E2ETests/ServalClientHelper.cs line 194 at r6 (raw file):

                FileFormat.Text,
                targetLanguage,
                isTarget: true

@johnml1135, do you mind if I just do it like this? I found the echo-specific logic in the helper to be confusing.


src/ServiceToolkit/test/SIL.ServiceToolkit/Utils/data/source2.txt line 1 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

These should be removed from Machine/test, they are no longer needed.

Once the other tests are ported, they won't be needed, right. But for now, they're duplicated.

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Nov 8, 2024

I'll go ahead and rebase this once the machine side is through, but I've made the updates here so you can see how the NParallelTextCorpus will be used.

@Enkidu93
Copy link
Collaborator Author

Also fixes #536

@Enkidu93
Copy link
Collaborator Author

Fixes #533 as well

@Enkidu93
Copy link
Collaborator Author

This also follows up on sillsdev/machine#263

@Enkidu93
Copy link
Collaborator Author

I believe that this is ready for review, @ddaspit & @johnml1135

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 43 at r10 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The filtering logic for source and target are the same. This should be pulled out into a separate function.

I agree - the readability would be improved.

@Enkidu93
Copy link
Collaborator Author

I believe that this is ready for review, @ddaspit & @johnml1135

Oh, I'm sorry, @ddaspit. I don't know how I missed your earlier comments. I'll get right on those.

@johnml1135 johnml1135 requested a review from ddaspit November 26, 2024 15:57
Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 22 files at r8, 3 of 10 files at r10, 2 of 6 files at r11, 2 of 5 files at r12, 1 of 1 files at r13, 1 of 2 files at r14, 6 of 6 files at r15, 1 of 1 files at r16, 4 of 4 files at r17, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 38 of 39 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Serval/src/Serval.Translation/Services/EngineService.cs line 687 at r6 (raw file):

Previously, johnml1135 (John Lambert) wrote…

What is the significance of the logic change?

It's been a while, so I might not be answering your question correctly: If a deprecated corpus has no source files or target files, that ought to map to a parallel corpus that has no source corpora or target corpora respectively.


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 43 at r10 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I agree - the readability would be improved.

Done.


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 46 at r10 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

There is no need to perform this Where operation if TrainOnChapters is null. Also, we should call FilterTexts on the books specified in TrainOnChapters, so we don't unnecessarily parse books.

Done.


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 56 at r10 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would go ahead and pull this out to a separate function as well to improve readability.

Done.


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 63 at r10 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We exclude verses from the training corpus when PretranslateTextIds or PretranslateChapters is specified. We should also exclude verses from the training corpus when neither is specified.

At the moment, we're doing it the other way around: excluding training content from pretranslation, not pretranslation from training content. Regardless, is this the behavior we want? At the moment, I think the issue you are describing is 'covered' in the PreprocessBuildJob:

if (row.SourceSegment.Length > 0 && row.TargetSegment.Length == 0)
{
    pretranslateWriter.WriteStartObject();
    ...
    pretranslateCount++;
}

If trainOn is null but so is pretranslate or if pretranslateTextIds has a value, what should our behavior be? Do we only want to exclude things at the textId/book level or in these circumstances, wouldn't it be better to look at the row-level? Otherwise, how could you get the train-on-all-parallel-data-pretranslate-all-non-parallel data functionality (which seems like the most straightforward use)?


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 89 at r10 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This check can be moved up higher.

Done.


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 3 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be implemented as a service that we inject using DI. You should create an extension method that adds the required services to the DI in an IServiceCollectionExtensions class.

This is something you fixed while I was gone, right?


src/ServiceToolkit/test/SIL.ServiceToolkit/SIL.ServiceToolkit.Tests.csproj line at r7 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The directory name should match the assembly name.

This is something you fixed while I was gone, right?

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 119 at r18 (raw file):

        if (corpus.PretranslateTextIds is not null)
        {
            return textCorpus.FilterTexts(corpus.PretranslateTextIds.Except(corpus.TrainOnTextIds ?? new()));

Sorry for beating the dead horse - but I thought that we arrived at this logic for pretranslating: #533 (comment) - On a per ParallelCorpus basis, if there is target training data that passes the filter, those verses will not be pretranslated. Therefore, the following would hold:

  • The training filters are not used at all when filtering for pretranslation
  • The filtered training target is aligned with the filtered source pretranslations and if the length of the target is > 0, don't pretranslate (which is already implemented in the action passed from PreprocessBuildJob).

@Enkidu93
Copy link
Collaborator Author

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 119 at r18 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Sorry for beating the dead horse - but I thought that we arrived at this logic for pretranslating: #533 (comment) - On a per ParallelCorpus basis, if there is target training data that passes the filter, those verses will not be pretranslated. Therefore, the following would hold:

  • The training filters are not used at all when filtering for pretranslation
  • The filtered training target is aligned with the filtered source pretranslations and if the length of the target is > 0, don't pretranslate (which is already implemented in the action passed from PreprocessBuildJob).

You're totally right, John. There's an inconsistency here. I think I misunderstood the #533 comment and missed the 'if there is data' part, so I was thinking that we were only using the filters to include/exclude, but I kept the old behavior with train on all. If that is what we want, then the changes I just made should make it function like that. This also addresses my comment elsewhere to @ddaspit regarding filtering behavior. This is more consistent.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 6 files at r11, 4 of 5 files at r12, 2 of 2 files at r14, 5 of 6 files at r15, 1 of 1 files at r16, 4 of 4 files at r17, 2 of 2 files at r19, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 63 at r10 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

At the moment, we're doing it the other way around: excluding training content from pretranslation, not pretranslation from training content. Regardless, is this the behavior we want? At the moment, I think the issue you are describing is 'covered' in the PreprocessBuildJob:

if (row.SourceSegment.Length > 0 && row.TargetSegment.Length == 0)
{
    pretranslateWriter.WriteStartObject();
    ...
    pretranslateCount++;
}

If trainOn is null but so is pretranslate or if pretranslateTextIds has a value, what should our behavior be? Do we only want to exclude things at the textId/book level or in these circumstances, wouldn't it be better to look at the row-level? Otherwise, how could you get the train-on-all-parallel-data-pretranslate-all-non-parallel data functionality (which seems like the most straightforward use)?

The code in PreprocessBuildJob should cover it. Yes, we need to check on a row-level if trainOn and pretranslate are not specified.


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 3 at r6 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This is something you fixed while I was gone, right?

Yes, you're correct.


src/ServiceToolkit/test/SIL.ServiceToolkit/SIL.ServiceToolkit.Tests.csproj line at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This is something you fixed while I was gone, right?

Yes, I fixed it.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r19.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@johnml1135 johnml1135 merged commit d851e0b into main Nov 26, 2024
2 checks passed
@johnml1135 johnml1135 deleted the move_preprocess_logic_to_toolkit branch November 26, 2024 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Echo translation engine doesn't work for Paratext projects
4 participants