Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serval-side mixed source support #497

Merged
merged 45 commits into from
Oct 9, 2024
Merged

Serval-side mixed source support #497

merged 45 commits into from
Oct 9, 2024

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Sep 23, 2024

THIS IS A DRAFT

There are still tests to add, refactoring to do, and documentation to add as well as a relatively small change to the Machine repo to allow it to handle multiple target files appropriately, but I think you can see the idea. I'll set this aside until I have your feedback and either your OK to wrap things up or your instructions to jump ship and use a different API design. That being said, I may work on the change to machine since I believe that will be needed regardless.


This change is Reviewable

@Enkidu93 Enkidu93 requested a review from ddaspit September 23, 2024 20:56
@codecov-commenter
Copy link

codecov-commenter commented Sep 23, 2024

Codecov Report

Attention: Patch coverage is 69.94144% with 462 lines in your changes missing coverage. Please review.

Project coverage is 56.62%. Comparing base (4382410) to head (7eef883).
Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
.../Serval.DataFiles/Controllers/CorporaController.cs 0.00% 81 Missing ⚠️
...l.Machine.Shared/Services/ClearMLMonitorService.cs 0.00% 64 Missing ⚠️
...l/src/Serval.Translation/Services/EngineService.cs 78.27% 43 Missing and 10 partials ⚠️
...c/Serval.Machine.Shared/Services/ClearMLService.cs 4.87% 39 Missing ⚠️
...lation/Controllers/TranslationEnginesController.cs 86.80% 26 Missing and 12 partials ⚠️
...hared/Services/ServalTranslationEngineServiceV1.cs 0.00% 24 Missing ⚠️
...ine.Shared/Services/DistributedReaderWriterLock.cs 81.73% 12 Missing and 7 partials ⚠️
...viceToolkit/src/SIL.ServiceToolkit/Utils/TaskEx.cs 45.16% 16 Missing and 1 partial ⚠️
.../Serval.Machine.Shared/Services/BuildJobService.cs 69.23% 10 Missing and 2 partials ⚠️
...rval.Machine.Shared/Services/TimeoutInterceptor.cs 0.00% 10 Missing ⚠️
... and 41 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #497      +/-   ##
==========================================
+ Coverage   56.59%   56.62%   +0.02%     
==========================================
  Files         275      299      +24     
  Lines       14174    15592    +1418     
  Branches     1908     2149     +241     
==========================================
+ Hits         8022     8829     +807     
- Misses       5563     6109     +546     
- Partials      589      654      +65     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Enkidu93
Copy link
Collaborator Author

Note: I'm realizing I'm accidentally mixing across corpus pairs (easy to fix) but in case someone gets to looking at this before I make that change, I'm aware.

@Enkidu93
Copy link
Collaborator Author

@ddaspit I pushed a proof-of-concept. If you look at the PreprocessBuildJob, I wrote a second WriteDataFilesAsync function that's public with the updated logic. I tested it in the PreprocessBuildJobTests. Let me know if this looks like the right direction, and I'll pursue it/clean it up/fix the grpc data models etc.

@ddaspit
Copy link
Contributor

ddaspit commented Sep 30, 2024

I took a quick look. This looks like a good direction. You should continue down this path.

… should be written 'TODO')

Note: With this new implementation, text marked for pretranslation will be pretranslated even if there's target text for the same verses. I believe this is not an issue (and maybe the previous implementation was a mistake) given the USFM source text preference options.
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not finished reviewing yet, but here is the first batch of suggestions. So far this is looking really good.

Reviewed 38 of 44 files at r1, 6 of 8 files at r2, 12 of 25 files at r4, all commit messages.
Reviewable status: 52 of 65 files reviewed, 18 unresolved discussions (waiting on @Enkidu93)


src/Serval/src/Serval.DataFiles/Configuration/IMongoDataAccessConfiguratorExtensions.cs line 27 at r4 (raw file):

    }

    public static IMongoDataAccessConfigurator AddCorporaRepository(this IMongoDataAccessConfigurator configurator)

This can be merged into the AddDataFilesRepositories.


src/Serval/src/Serval.DataFiles/Configuration/IServalBuilderExtensions.cs line 17 at r4 (raw file):

    }

    public static IServalBuilder AddCorpora(this IServalBuilder builder)

This can be merged into the AddDataFiles method.


src/Serval/src/Serval.Translation/Contracts/ParallelCorpusDto.cs line 0 at r4 (raw file):
I don't believe this file is used anywhere.


src/Serval/src/Serval.Translation/Contracts/ParallelCorpusFilterConfigDto.cs line 5 at r4 (raw file):

public record ParallelCorpusFilterConfigDto
{
    public required string CorpusRef { get; init; }

This should be named CorpusId.


src/Serval/src/Serval.Translation/Contracts/PretranslateCorpusConfigDto.cs line 13 at r4 (raw file):

    public string? ParallelCorpusId { get; init; }
    public IReadOnlyList<ParallelCorpusFilterConfigDto>? SourceFilters { get; init; }
    public IReadOnlyList<ParallelCorpusFilterConfigDto>? TargetFilters { get; init; }

The TargetFilters property isn't necessary.


src/Serval/src/Serval.Translation/Contracts/PretranslateCorpusDto.cs line 13 at r4 (raw file):

    public ResourceLinkDto? ParallelCorpus { get; init; }
    public IReadOnlyList<ParallelCorpusFilterDto>? SourceFilters { get; init; }
    public IReadOnlyList<ParallelCorpusFilterDto>? TargetFilters { get; init; }

The TargetFilters property isn't necessary.


src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusConfigDto.cs line 10 at r4 (raw file):

    public string? Name { get; init; }

    public required IReadOnlyList<string> SourceCorporaRefs { get; init; } = new List<string>();

I would rename this to SourceCorpusIds.


src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusDto.cs line 9 at r4 (raw file):

    public required ResourceLinkDto Engine { get; init; }

    public required IReadOnlyList<TranslationMonolingualCorpusDto> SourceCorpora { get; init; } =

This should be a list of ResourceLinkDtos.


src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusDto.cs line 10 at r4 (raw file):

    public required IReadOnlyList<TranslationMonolingualCorpusDto> SourceCorpora { get; init; } =
        new List<TranslationMonolingualCorpusDto>();

For required properties with init setters, it isn't necessary to supply a default value. It will just result in an unnecessary list allocation.


src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusUpdateDto.cs line 7 at r4 (raw file):

public record TranslationParallelCorpusUpdateConfigDto : IValidatableObject
{
    public IReadOnlyList<string>? SourceCorpusRefs { get; init; }

I would rename this to SourceCorpusIds.


src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs line 558 at r4 (raw file):

        await AuthorizeAsync(engine);
        Corpus corpus = await MapAsync(getDataFileClient, idGenerator.GenerateId(), corpusConfig, cancellationToken);
        await _engineService.AddCorpusAsync(id, corpus, cancellationToken);

Should this be adding a ParallelCorpus?


src/Serval/src/Serval.Translation/Models/PretranslateCorpus.cs line 11 at r4 (raw file):

    public string? ParallelCorpusRef { get; set; }
    public IReadOnlyList<ParallelCorpusFilter>? SourceFilters { get; set; }
    public IReadOnlyList<ParallelCorpusFilter>? TargetFilters { get; set; } //TODO is this needed?

We don't need this.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 419 at r4 (raw file):

    }

    public Task AddParallelCorpus(

This should be named AddParallelCorpusAsync.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 425 at r4 (raw file):

    )
    {
        return Entities.UpdateAsync(

Do we want to disallow adding both a ParallelCorpus and a Corpus to the same engine? I didn't see a check for this anywhere.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 597 at r4 (raw file):

    private V1.ParallelCorpus Map(Corpus source, TrainingCorpus? trainingCorpus, PretranslateCorpus? pretranslateCorpus)
    {
        var sourceFiles = source.SourceFiles.Select(Map);

Please specify the type when it isn't explicit elsewhere in the line.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 599 at r4 (raw file):

        var sourceFiles = source.SourceFiles.Select(Map);
        var targetFiles = source.TargetFiles.Select(Map);
        var sourceCorpus = new V1.MonolingualCorpus

I know our codebase is inconsistent about where the type is specified when creating a new object instance, but I've been trying to shift the codebase to specify the type on the left-hand side instead of using var whenever I change anything.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 731 at r4 (raw file):

    )
    {
        var trainOnChapters =

This is probably a bit too much to put on one line of code. Can you refactor this to use an if statement?


src/Serval/src/Serval.Translation/Services/EngineService.cs line 745 at r4 (raw file):

                : null;

        var pretranslateChapters =

See my comment above.

@Enkidu93 Enkidu93 marked this pull request as ready for review October 4, 2024 13:36
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 25 files at r4.
Reviewable status: all files reviewed, 20 unresolved discussions (waiting on @Enkidu93)


src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 134 at r4 (raw file):

                .ToArray();
            ITextCorpus[] sourceTrainingCorpora = sourceCorpora
                .Select(sc =>

This is really hard for me to follow. It would help readability if we merge the two Select calls into one and break up the complex conditional in the second Select into functions. There seems to be some overlap between the logic in the training corpora and the pretranslate corpora that could be shared by refactoring into functions.


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 811 at r4 (raw file):

        }

        public static ParallelCorpus TextFileCorpus(

You can move these static methods into the test fixture class.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you! Sorry it's taken so long.

Reviewable status: 48 of 65 files reviewed, 20 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 811 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can move these static methods into the test fixture class.

I can move these but it's a bit more complicated than just moving these functions since they rely on ParatextFile and TextFile which rely on the _tempDir and TestDataPath property. Do you still want me to do that?


src/Serval/src/Serval.DataFiles/Configuration/IMongoDataAccessConfiguratorExtensions.cs line 27 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This can be merged into the AddDataFilesRepositories.

Done.


src/Serval/src/Serval.DataFiles/Configuration/IServalBuilderExtensions.cs line 17 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This can be merged into the AddDataFiles method.

Done.


src/Serval/src/Serval.Translation/Contracts/ParallelCorpusFilterConfigDto.cs line 5 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be named CorpusId.

Done. In general, what's the pattern you're going for in Serval for when to use...Id vs. ...Ref? It didn't seem wholly consistent. My impression was that we were using Id when it's in reference to the identifier for that object itself but Refwhen it's referring to another object's id. Is that about it?


src/Serval/src/Serval.Translation/Contracts/PretranslateCorpusConfigDto.cs line 13 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The TargetFilters property isn't necessary.

Done.


src/Serval/src/Serval.Translation/Contracts/PretranslateCorpusDto.cs line 13 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The TargetFilters property isn't necessary.

Done.


src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusConfigDto.cs line 10 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would rename this to SourceCorpusIds.

Done.


src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusDto.cs line 9 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be a list of ResourceLinkDtos.

OK, done.


src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusDto.cs line 10 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

For required properties with init setters, it isn't necessary to supply a default value. It will just result in an unnecessary list allocation.

Oops, sorry - I think I previously didn't have it as required with an init and didn't change it. Thanks!


src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusUpdateDto.cs line 7 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would rename this to SourceCorpusIds.

Done.


src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs line 558 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Should this be adding a ParallelCorpus?

Oops. Done.


src/Serval/src/Serval.Translation/Models/PretranslateCorpus.cs line 11 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We don't need this.

Done.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 419 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be named AddParallelCorpusAsync.

Done.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 425 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Do we want to disallow adding both a ParallelCorpus and a Corpus to the same engine? I didn't see a check for this anywhere.

That's a good question: I didn't know what we wanted in that regard. Ultimately, I know the plan is to deprecate and then remove the old corpus functionality, but I wasn't sure if there would be a reason to allow for both on a single engine in the meantime. Maybe for testing/transitioning? I.e., make an engine that has each, run both jobs, and confirm that the behavior is the same? Or add the equivalent parallel corpus and rerun and confirm before switching things over? @johnml1135, do you have any thoughts? There will be some small changes needed either way.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 597 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Please specify the type when it isn't explicit elsewhere in the line.

Done.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 599 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I know our codebase is inconsistent about where the type is specified when creating a new object instance, but I've been trying to shift the codebase to specify the type on the left-hand side instead of using var whenever I change anything.

Done. If you notice more, let me know. I think I used some but mainly in the tests. I agree though that the type on the left is superior 😎, so I'm glad we're moving to that + new() as standard instead of var.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 731 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This is probably a bit too much to put on one line of code. Can you refactor this to use an if statement?

Done. Yeah, it grew over time 😆.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 745 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

See my comment above.

Done.


src/Serval/src/Serval.Translation/Contracts/ParallelCorpusDto.cs line at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't believe this file is used anywhere.

Done.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 48 of 65 files reviewed, 20 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 134 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This is really hard for me to follow. It would help readability if we merge the two Select calls into one and break up the complex conditional in the second Select into functions. There seems to be some overlap between the logic in the training corpora and the pretranslate corpora that could be shared by refactoring into functions.

Let me know if this is clear enough now.

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Oct 4, 2024

Also, I realize more testing is needed. Would you like that to be a separate PR or should I go ahead and do that now?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add the tests in this PR.

Reviewed 17 of 17 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)


src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 134 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Let me know if this is clear enough now.

This is much clearer. Good job.


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 811 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I can move these but it's a bit more complicated than just moving these functions since they rely on ParatextFile and TextFile which rely on the _tempDir and TestDataPath property. Do you still want me to do that?

It looks like you can move TextFileCorpus, TextFile, and TestDataPath, then you don't have to reference them by the TestEnvironment class name. This isn't really a big deal, so it is fine if you don't want to bother with it.


src/Serval/src/Serval.Translation/Contracts/ParallelCorpusFilterConfigDto.cs line 5 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Done. In general, what's the pattern you're going for in Serval for when to use...Id vs. ...Ref? It didn't seem wholly consistent. My impression was that we were using Id when it's in reference to the identifier for that object itself but Refwhen it's referring to another object's id. Is that about it?

We use Id in DTOs and Ref in database models. Mongo doesn't really have the concept of foreign keys like you see in SQL DBs, so I use Ref as a way to indicate that a field is a foreign key. There is also a class map convention that ensures that the type of Ref properties is an ObjectId.


src/Serval/src/Serval.Translation/Services/EngineService.cs line 425 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

That's a good question: I didn't know what we wanted in that regard. Ultimately, I know the plan is to deprecate and then remove the old corpus functionality, but I wasn't sure if there would be a reason to allow for both on a single engine in the meantime. Maybe for testing/transitioning? I.e., make an engine that has each, run both jobs, and confirm that the behavior is the same? Or add the equivalent parallel corpus and rerun and confirm before switching things over? @johnml1135, do you have any thoughts? There will be some small changes needed either way.

SF will need to migrate engines from the old API to the new API. During this transition, it might be simpler if engines can have both, so maybe we should just allow it for now.

* Call 'GetAsksById' once per DoWork
Fix bug (? no testing to verify, but I think some build jobs were getting double counted)

* Get queue by ID - take time from 3-5 seconds per query to < 100ms.

* Address review comments

* Address reviewer comments

---------

Co-authored-by: John Lambert <[email protected]>
@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Oct 9, 2024

Reviewed 3 of 3 files at r7, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

src/Serval/test/Serval.ApiServer.IntegrationTests/TranslationEngineTests.cs line 808 at r7 (raw file):

    [TestCase(new[] { Scopes.UpdateTranslationEngines }, 404, DOES_NOT_EXIST_ENGINE_ID)]
    [TestCase(new[] { Scopes.ReadFiles }, 403, ECHO_ENGINE1_ID)] //Arbitrary unrelated privilege
    public async Task AddParallelCorpusToEngineByIdAsync(

I find using test cases in this way makes the tests difficult to follow. Can you refactor these test cases into separate tests?

I know you do; I'm sorry. I only wrote them like this for consistency with the existing corpus tests - this way, they're totally parallel and we can be certain we're testing the same breadth of things. But sure, given that the non-parallel-corpus-style corpus tests will eventually be removed, I can refactor these.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 24 of 44 files at r1, 6 of 8 files at r2, 18 of 25 files at r4, 17 of 17 files at r5.
Reviewable status: 62 of 68 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


src/Serval/test/Serval.ApiServer.IntegrationTests/TranslationEngineTests.cs line 808 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I know you do; I'm sorry. I only wrote them like this for consistency with the existing corpus tests - this way, they're totally parallel and we can be certain we're testing the same breadth of things. But sure, given that the non-parallel-corpus-style corpus tests will eventually be removed, I can refactor these.

Alrighty, done! I also added a few more tests to improve coverage. At some point, I'd like to go back and add additional testing in general.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 62 of 68 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


src/Serval/src/Serval.Translation/Services/EngineService.cs line 425 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Since this will only be temporary functionality, I don't see why extra checks are needed.

Alright, we should be good to go then.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work.

Reviewed 4 of 6 files at r8, 3 of 3 files at r9, 2 of 2 files at r10, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@johnml1135
Copy link
Collaborator

@Enkidu93 - Can you rebase the files with the recent updates?

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Sorry it was so long coming.

Yes, I'll go ahead and do that, John!

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@Enkidu93 Enkidu93 merged commit 93ca485 into main Oct 9, 2024
2 of 3 checks passed
@Enkidu93 Enkidu93 deleted the mixed_source_support branch October 9, 2024 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants