-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serval-side mixed source support #497
Conversation
Note: I'm realizing I'm accidentally mixing across corpus pairs (easy to fix) but in case someone gets to looking at this before I make that change, I'm aware. |
@ddaspit I pushed a proof-of-concept. If you look at the |
I took a quick look. This looks like a good direction. You should continue down this path. |
… should be written 'TODO') Note: With this new implementation, text marked for pretranslation will be pretranslated even if there's target text for the same verses. I believe this is not an issue (and maybe the previous implementation was a mistake) given the USFM source text preference options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not finished reviewing yet, but here is the first batch of suggestions. So far this is looking really good.
Reviewed 38 of 44 files at r1, 6 of 8 files at r2, 12 of 25 files at r4, all commit messages.
Reviewable status: 52 of 65 files reviewed, 18 unresolved discussions (waiting on @Enkidu93)
src/Serval/src/Serval.DataFiles/Configuration/IMongoDataAccessConfiguratorExtensions.cs
line 27 at r4 (raw file):
} public static IMongoDataAccessConfigurator AddCorporaRepository(this IMongoDataAccessConfigurator configurator)
This can be merged into the AddDataFilesRepositories
.
src/Serval/src/Serval.DataFiles/Configuration/IServalBuilderExtensions.cs
line 17 at r4 (raw file):
} public static IServalBuilder AddCorpora(this IServalBuilder builder)
This can be merged into the AddDataFiles
method.
src/Serval/src/Serval.Translation/Contracts/ParallelCorpusDto.cs
line 0 at r4 (raw file):
I don't believe this file is used anywhere.
src/Serval/src/Serval.Translation/Contracts/ParallelCorpusFilterConfigDto.cs
line 5 at r4 (raw file):
public record ParallelCorpusFilterConfigDto { public required string CorpusRef { get; init; }
This should be named CorpusId
.
src/Serval/src/Serval.Translation/Contracts/PretranslateCorpusConfigDto.cs
line 13 at r4 (raw file):
public string? ParallelCorpusId { get; init; } public IReadOnlyList<ParallelCorpusFilterConfigDto>? SourceFilters { get; init; } public IReadOnlyList<ParallelCorpusFilterConfigDto>? TargetFilters { get; init; }
The TargetFilters
property isn't necessary.
src/Serval/src/Serval.Translation/Contracts/PretranslateCorpusDto.cs
line 13 at r4 (raw file):
public ResourceLinkDto? ParallelCorpus { get; init; } public IReadOnlyList<ParallelCorpusFilterDto>? SourceFilters { get; init; } public IReadOnlyList<ParallelCorpusFilterDto>? TargetFilters { get; init; }
The TargetFilters
property isn't necessary.
src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusConfigDto.cs
line 10 at r4 (raw file):
public string? Name { get; init; } public required IReadOnlyList<string> SourceCorporaRefs { get; init; } = new List<string>();
I would rename this to SourceCorpusIds
.
src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusDto.cs
line 9 at r4 (raw file):
public required ResourceLinkDto Engine { get; init; } public required IReadOnlyList<TranslationMonolingualCorpusDto> SourceCorpora { get; init; } =
This should be a list of ResourceLinkDto
s.
src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusDto.cs
line 10 at r4 (raw file):
public required IReadOnlyList<TranslationMonolingualCorpusDto> SourceCorpora { get; init; } = new List<TranslationMonolingualCorpusDto>();
For required
properties with init
setters, it isn't necessary to supply a default value. It will just result in an unnecessary list allocation.
src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusUpdateDto.cs
line 7 at r4 (raw file):
public record TranslationParallelCorpusUpdateConfigDto : IValidatableObject { public IReadOnlyList<string>? SourceCorpusRefs { get; init; }
I would rename this to SourceCorpusIds
.
src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs
line 558 at r4 (raw file):
await AuthorizeAsync(engine); Corpus corpus = await MapAsync(getDataFileClient, idGenerator.GenerateId(), corpusConfig, cancellationToken); await _engineService.AddCorpusAsync(id, corpus, cancellationToken);
Should this be adding a ParallelCorpus
?
src/Serval/src/Serval.Translation/Models/PretranslateCorpus.cs
line 11 at r4 (raw file):
public string? ParallelCorpusRef { get; set; } public IReadOnlyList<ParallelCorpusFilter>? SourceFilters { get; set; } public IReadOnlyList<ParallelCorpusFilter>? TargetFilters { get; set; } //TODO is this needed?
We don't need this.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 419 at r4 (raw file):
} public Task AddParallelCorpus(
This should be named AddParallelCorpusAsync
.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 425 at r4 (raw file):
) { return Entities.UpdateAsync(
Do we want to disallow adding both a ParallelCorpus
and a Corpus
to the same engine? I didn't see a check for this anywhere.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 597 at r4 (raw file):
private V1.ParallelCorpus Map(Corpus source, TrainingCorpus? trainingCorpus, PretranslateCorpus? pretranslateCorpus) { var sourceFiles = source.SourceFiles.Select(Map);
Please specify the type when it isn't explicit elsewhere in the line.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 599 at r4 (raw file):
var sourceFiles = source.SourceFiles.Select(Map); var targetFiles = source.TargetFiles.Select(Map); var sourceCorpus = new V1.MonolingualCorpus
I know our codebase is inconsistent about where the type is specified when creating a new object instance, but I've been trying to shift the codebase to specify the type on the left-hand side instead of using var
whenever I change anything.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 731 at r4 (raw file):
) { var trainOnChapters =
This is probably a bit too much to put on one line of code. Can you refactor this to use an if
statement?
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 745 at r4 (raw file):
: null; var pretranslateChapters =
See my comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 25 files at r4.
Reviewable status: all files reviewed, 20 unresolved discussions (waiting on @Enkidu93)
src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs
line 134 at r4 (raw file):
.ToArray(); ITextCorpus[] sourceTrainingCorpora = sourceCorpora .Select(sc =>
This is really hard for me to follow. It would help readability if we merge the two Select
calls into one and break up the complex conditional in the second Select
into functions. There seems to be some overlap between the logic in the training corpora and the pretranslate corpora that could be shared by refactoring into functions.
src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs
line 811 at r4 (raw file):
} public static ParallelCorpus TextFileCorpus(
You can move these static methods into the test fixture class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thank you! Sorry it's taken so long.
Reviewable status: 48 of 65 files reviewed, 20 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs
line 811 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You can move these static methods into the test fixture class.
I can move these but it's a bit more complicated than just moving these functions since they rely on ParatextFile
and TextFile
which rely on the _tempDir
and TestDataPath
property. Do you still want me to do that?
src/Serval/src/Serval.DataFiles/Configuration/IMongoDataAccessConfiguratorExtensions.cs
line 27 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This can be merged into the
AddDataFilesRepositories
.
Done.
src/Serval/src/Serval.DataFiles/Configuration/IServalBuilderExtensions.cs
line 17 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This can be merged into the
AddDataFiles
method.
Done.
src/Serval/src/Serval.Translation/Contracts/ParallelCorpusFilterConfigDto.cs
line 5 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This should be named
CorpusId
.
Done. In general, what's the pattern you're going for in Serval for when to use...Id
vs. ...Ref
? It didn't seem wholly consistent. My impression was that we were using Id
when it's in reference to the identifier for that object itself but Ref
when it's referring to another object's id. Is that about it?
src/Serval/src/Serval.Translation/Contracts/PretranslateCorpusConfigDto.cs
line 13 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
The
TargetFilters
property isn't necessary.
Done.
src/Serval/src/Serval.Translation/Contracts/PretranslateCorpusDto.cs
line 13 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
The
TargetFilters
property isn't necessary.
Done.
src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusConfigDto.cs
line 10 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would rename this to
SourceCorpusIds
.
Done.
src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusDto.cs
line 9 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This should be a list of
ResourceLinkDto
s.
OK, done.
src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusDto.cs
line 10 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
For
required
properties withinit
setters, it isn't necessary to supply a default value. It will just result in an unnecessary list allocation.
Oops, sorry - I think I previously didn't have it as required with an init and didn't change it. Thanks!
src/Serval/src/Serval.Translation/Contracts/TranslationParallelCorpusUpdateDto.cs
line 7 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would rename this to
SourceCorpusIds
.
Done.
src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs
line 558 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Should this be adding a
ParallelCorpus
?
Oops. Done.
src/Serval/src/Serval.Translation/Models/PretranslateCorpus.cs
line 11 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
We don't need this.
Done.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 419 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This should be named
AddParallelCorpusAsync
.
Done.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 425 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Do we want to disallow adding both a
ParallelCorpus
and aCorpus
to the same engine? I didn't see a check for this anywhere.
That's a good question: I didn't know what we wanted in that regard. Ultimately, I know the plan is to deprecate and then remove the old corpus functionality, but I wasn't sure if there would be a reason to allow for both on a single engine in the meantime. Maybe for testing/transitioning? I.e., make an engine that has each, run both jobs, and confirm that the behavior is the same? Or add the equivalent parallel corpus and rerun and confirm before switching things over? @johnml1135, do you have any thoughts? There will be some small changes needed either way.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 597 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Please specify the type when it isn't explicit elsewhere in the line.
Done.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 599 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I know our codebase is inconsistent about where the type is specified when creating a new object instance, but I've been trying to shift the codebase to specify the type on the left-hand side instead of using
var
whenever I change anything.
Done. If you notice more, let me know. I think I used some but mainly in the tests. I agree though that the type on the left is superior 😎, so I'm glad we're moving to that + new()
as standard instead of var
.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 731 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This is probably a bit too much to put on one line of code. Can you refactor this to use an
if
statement?
Done. Yeah, it grew over time 😆.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 745 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
See my comment above.
Done.
src/Serval/src/Serval.Translation/Contracts/ParallelCorpusDto.cs
line at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I don't believe this file is used anywhere.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 48 of 65 files reviewed, 20 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs
line 134 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This is really hard for me to follow. It would help readability if we merge the two
Select
calls into one and break up the complex conditional in the secondSelect
into functions. There seems to be some overlap between the logic in the training corpora and the pretranslate corpora that could be shared by refactoring into functions.
Let me know if this is clear enough now.
Also, I realize more testing is needed. Would you like that to be a separate PR or should I go ahead and do that now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add the tests in this PR.
Reviewed 17 of 17 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs
line 134 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Let me know if this is clear enough now.
This is much clearer. Good job.
src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs
line 811 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
I can move these but it's a bit more complicated than just moving these functions since they rely on
ParatextFile
andTextFile
which rely on the_tempDir
andTestDataPath
property. Do you still want me to do that?
It looks like you can move TextFileCorpus
, TextFile
, and TestDataPath
, then you don't have to reference them by the TestEnvironment
class name. This isn't really a big deal, so it is fine if you don't want to bother with it.
src/Serval/src/Serval.Translation/Contracts/ParallelCorpusFilterConfigDto.cs
line 5 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Done. In general, what's the pattern you're going for in Serval for when to use
...Id
vs....Ref
? It didn't seem wholly consistent. My impression was that we were usingId
when it's in reference to the identifier for that object itself butRef
when it's referring to another object's id. Is that about it?
We use Id
in DTOs and Ref
in database models. Mongo doesn't really have the concept of foreign keys like you see in SQL DBs, so I use Ref
as a way to indicate that a field is a foreign key. There is also a class map convention that ensures that the type of Ref
properties is an ObjectId
.
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 425 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
That's a good question: I didn't know what we wanted in that regard. Ultimately, I know the plan is to deprecate and then remove the old corpus functionality, but I wasn't sure if there would be a reason to allow for both on a single engine in the meantime. Maybe for testing/transitioning? I.e., make an engine that has each, run both jobs, and confirm that the behavior is the same? Or add the equivalent parallel corpus and rerun and confirm before switching things over? @johnml1135, do you have any thoughts? There will be some small changes needed either way.
SF will need to migrate engines from the old API to the new API. During this transition, it might be simpler if engines can have both, so maybe we should just allow it for now.
* Call 'GetAsksById' once per DoWork Fix bug (? no testing to verify, but I think some build jobs were getting double counted) * Get queue by ID - take time from 3-5 seconds per query to < 100ms. * Address review comments * Address reviewer comments --------- Co-authored-by: John Lambert <[email protected]>
I know you do; I'm sorry. I only wrote them like this for consistency with the existing corpus tests - this way, they're totally parallel and we can be certain we're testing the same breadth of things. But sure, given that the non-parallel-corpus-style corpus tests will eventually be removed, I can refactor these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 24 of 44 files at r1, 6 of 8 files at r2, 18 of 25 files at r4, 17 of 17 files at r5.
Reviewable status: 62 of 68 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)
src/Serval/test/Serval.ApiServer.IntegrationTests/TranslationEngineTests.cs
line 808 at r7 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
I know you do; I'm sorry. I only wrote them like this for consistency with the existing corpus tests - this way, they're totally parallel and we can be certain we're testing the same breadth of things. But sure, given that the non-parallel-corpus-style corpus tests will eventually be removed, I can refactor these.
Alrighty, done! I also added a few more tests to improve coverage. At some point, I'd like to go back and add additional testing in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 62 of 68 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 425 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Since this will only be temporary functionality, I don't see why extra checks are needed.
Alright, we should be good to go then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work.
Reviewed 4 of 6 files at r8, 3 of 3 files at r9, 2 of 2 files at r10, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
@Enkidu93 - Can you rebase the files with the recent updates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Sorry it was so long coming.
Yes, I'll go ahead and do that, John!
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
… should be written 'TODO') Note: With this new implementation, text marked for pretranslation will be pretranslated even if there's target text for the same verses. I believe this is not an issue (and maybe the previous implementation was a mistake) given the USFM source text preference options.
…rval into mixed_source_support
THIS IS A DRAFT
There are still tests to add, refactoring to do, and documentation to add as well as a relatively small change to the Machine repo to allow it to handle multiple target files appropriately, but I think you can see the idea. I'll set this aside until I have your feedback and either your OK to wrap things up or your instructions to jump ship and use a different API design. That being said, I may work on the change to machine since I believe that will be needed regardless.
This change is