Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use terms localizations #233

Merged
merged 7 commits into from
Aug 22, 2024
Merged

Use terms localizations #233

merged 7 commits into from
Aug 22, 2024

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Aug 15, 2024

Fixes sillsdev/serval#240

Parallel PR: sillsdev/serval#454


This change is Reviewable

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 48 at r1 (raw file):

                bool useTermsRenderingXml = !preferTermsLocalization && termsFileEntry != null;

                if (!SupportedLanguageTermsLocalizationXmls.TryGetValue(languageCode, out string resourceName))

If languageCode is null, we should use the language code specified in the PT settings file.

Also, the localizations only apply to the Major Biblical Terms list.


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 171 at r1 (raw file):

        {
            //If entire term rendering is surrounded in square brackets, remove them
            Regex rx = new Regex(@"^\[(.+?)\]$", RegexOptions.Compiled);

We should take the opportunity to make this regex a constant.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 13 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 48 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

If languageCode is null, we should use the language code specified in the PT settings file.

Also, the localizations only apply to the Major Biblical Terms list.

I've gone ahead and just used the settings language code. Let me know if you think I should also provide a route to specify a code. It seemed to me that one or the other would simplify things and be cleaner.

Thank you, yes! Done.


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 171 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We should take the opportunity to make this regex a constant.

Done. Also, with the other. Also, uncommented a line that I accidentally left commented..

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to rework the testing a little bit since it's awkward to test everything I need to. I'll ping you know when I'm ready for more review.

Reviewable status: 8 of 13 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)

@codecov-commenter
Copy link

codecov-commenter commented Aug 16, 2024

Codecov Report

Attention: Patch coverage is 86.52174% with 31 lines in your changes missing coverage. Please review.

Project coverage is 69.80%. Comparing base (a15d24e) to head (45c7682).

Files Patch % Lines
....Machine/Corpora/ParatextProjectTermsParserBase.cs 85.94% 20 Missing and 6 partials ⚠️
...L.Machine/Corpora/ZipParatextProjectTermsParser.cs 76.92% 1 Missing and 2 partials ⚠️
...chine/Corpora/ParatextProjectSettingsParserBase.cs 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #233      +/-   ##
==========================================
+ Coverage   69.66%   69.80%   +0.13%     
==========================================
  Files         377      379       +2     
  Lines       31377    31478     +101     
  Branches     4391     4399       +8     
==========================================
+ Hits        21860    21974     +114     
+ Misses       8498     8483      -15     
- Partials     1019     1021       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit, here's one solution. Let me know what you think. There are other options to make testing easier, but this seemed like the cleanest. I changed the name for the terms corpus to make it more consistent with the newer classes operating on Paratext projects. Do you think that's appropriate? If so, should I change the text corpus as well? A similar strategy could be used there to make testing more flexible. If that does happen, there might also be a way to change the inheritance/implementing to make it overall cleaner across classes that interact with Paratext projects.

Reviewable status: 8 of 16 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 5 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/MemoryParatextProjectTermsCorpus.cs line 7 at r3 (raw file):

namespace SIL.Machine.Corpora
{
    public class MemoryParatextProjectTermsCorpus : ParatextProjectTermsCorpusBase

This feels like a test mock and not a practically useful class. I would move it to the test library.


src/SIL.Machine/Corpora/ZipParatextProjectTermsCorpus.cs line 7 at r3 (raw file):

namespace SIL.Machine.Corpora
{
    public class ZipParatextProjectTermsCorpus : ParatextProjectTermsCorpusBase

I would prefer to keep the same name with the same constructor parameters so that it isn't a breaking change. The base class can be renamed to ParatextTermsCorpusBase to stay consistent.


src/SIL.Machine/Corpora/ParatextProjectTermsCorpusBase.cs line 37 at r3 (raw file):

        private static readonly Regex NumericalInformationRegex = new Regex(@"\s+\d+(\.\d+)*$", RegexOptions.Compiled);

        public ParatextProjectTermsCorpusBase(ParatextProjectSettings settings)

Constructors for abstract base classes should be protected.


src/SIL.Machine/Corpora/ParatextProjectTermsCorpusBase.cs line 42 at r3 (raw file):

        }

        protected void AddTexts(IEnumerable<string> termCategories, bool preferTermsLocalization = false)

It would be cleaner to move the logic in this method to the constructor.

I would also change the behavior, so that it uses the rendering if it exists for a term and falls back to the gloss if it doesn't. I would rename preferTermsLocalization to fallbackToTermGloss to match the behavior change.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit , I'm assuming from your lack of general comments that you do think this an appropriate way to enable better testing; is that correct?

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ZipParatextProjectTermsCorpus.cs line 7 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would prefer to keep the same name with the same constructor parameters so that it isn't a breaking change. The base class can be renamed to ParatextTermsCorpusBase to stay consistent.

OK, what about the ZipParatext... classes? Should we perhaps use the 'Backup' naming scheme there too?


src/SIL.Machine/Corpora/ParatextProjectTermsCorpusBase.cs line 42 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It would be cleaner to move the logic in this method to the constructor.

I would also change the behavior, so that it uses the rendering if it exists for a term and falls back to the gloss if it doesn't. I would rename preferTermsLocalization to fallbackToTermGloss to match the behavior change.

It was previously in the constructor. The issue is that in the zip child class, the _archive field needs to be set before running the constructor logic. Thoughts?

@Enkidu93
Copy link
Collaborator Author

src/SIL.Machine/Corpora/ParatextProjectTermsCorpusBase.cs line 42 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

It was previously in the constructor. The issue is that in the zip child class, the _archive field needs to be set before running the constructor logic. Thoughts?

Also, regarding the functionality change, I get what you're saying (move things to the term-level basically) but I don't understand the name. Wouldn't you always want to fall back to the gloss if there isn't a rendering and there is a gloss? If not, then would it make more sense to control it from a variable with a name more like useTermsLocalizations?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 113 at r3 (raw file):

                );
            }
            string languageCode = languageIsoCodeSettingParts[0];

So we want to crash if there is no language code? It makes general sense - but I wonder about all paratext projects. Can we test a bunch to make sure that we are not excluding some DBL resources that don't have the code set? Can we set it to a meaningless value instead? Should we?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your approach for testing is sound.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/ZipParatextProjectTermsCorpus.cs line 7 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK, what about the ZipParatext... classes? Should we perhaps use the 'Backup' naming scheme there too?

Yes, this class should be called ParatextBackupTermsCorpus, so that it has the same name as before.


src/SIL.Machine/Corpora/ParatextProjectTermsCorpusBase.cs line 42 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Also, regarding the functionality change, I get what you're saying (move things to the term-level basically) but I don't understand the name. Wouldn't you always want to fall back to the gloss if there isn't a rendering and there is a gloss? If not, then would it make more sense to control it from a variable with a name more like useTermsLocalizations?

Of course, I see the issue now. I would just get rid of the constructor altogether and pass the settings directly into the AddTexts method.

I do think there are cases where you wouldn't want to use the term glosses. The name is fine. I would just refer to the "localizations" as "glosses", i.e. useTermGlosses.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 113 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So we want to crash if there is no language code? It makes general sense - but I wonder about all paratext projects. Can we test a bunch to make sure that we are not excluding some DBL resources that don't have the code set? Can we set it to a meaningless value instead? Should we?

A Paratext project should have a defined language code setting. Of course, we have learned that anything can happen with a Paratext project, so I don't have a problem with setting the language code to null if it isn't defined. We will need to check for null when we use it.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 10 files at r1, 3 of 5 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93)

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 113 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

A Paratext project should have a defined language code setting. Of course, we have learned that anything can happen with a Paratext project, so I don't have a problem with setting the language code to null if it isn't defined. We will need to check for null when we use it.

And add a test to make sure it is handled gracefully. I believe it would be better to continue on than to crash in this scenario.

@Enkidu93 Enkidu93 requested a review from johnml1135 August 19, 2024 18:53
Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 113 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

And add a test to make sure it is handled gracefully. I believe it would be better to continue on than to crash in this scenario.

Alright.


src/SIL.Machine/Corpora/ZipParatextProjectTermsCorpus.cs line 7 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yes, this class should be called ParatextBackupTermsCorpus, so that it has the same name as before.

I understand that. I'm asking about the inconsistency of this name with ZipParatextProjectTextUpdater, ZipParatextProjectSettingsParser, and all the related classes. The reason I changed the name was to make it more consistent with them. Happy to change it back, but I'm wondering if we should also change the above classes to ParatextBackupTextUpdater etc. We kinda have two naming schemes going.


src/SIL.Machine/Corpora/ParatextProjectTermsCorpusBase.cs line 42 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Of course, I see the issue now. I would just get rid of the constructor altogether and pass the settings directly into the AddTexts method.

I do think there are cases where you wouldn't want to use the term glosses. The name is fine. I would just refer to the "localizations" as "glosses", i.e. useTermGlosses.

OK.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/ZipParatextProjectTermsCorpus.cs line 7 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I understand that. I'm asking about the inconsistency between ZipParatextProjectTextUpdater, ZipParatextProjectSettingsParser, and all the related classes. The reason I changed the name was to make it more consistent with them. Happy to change it back, but I'm wondering if we should also change the above classes to ParatextBackupTextUpdater etc. We kinda have two naming schemes going.

You can leave the names the way they are. It is unfortunate that they follow different naming schemes, but I do think of the corpus classes and the parser/updater classes differently. The corpus classes are higher-level classes that a developer/researcher would consume when creating an NLP pipeline, so their names are intended to clearly communicate the type of corpus. For example, the name ParatextBackupTextCorpus indicates to the user that this is the class to use when he wants to read a Paratext project backup file. The parser/updater classes are lower-level building blocks that are used to implement the higher-level classes, so their names reflect that.

@Enkidu93
Copy link
Collaborator Author

src/SIL.Machine/Corpora/ZipParatextProjectTermsCorpus.cs line 7 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can leave the names the way they are. It is unfortunate that they follow different naming schemes, but I do think of the corpus classes and the parser/updater classes differently. The corpus classes are higher-level classes that a developer/researcher would consume when creating an NLP pipeline, so their names are intended to clearly communicate the type of corpus. For example, the name ParatextBackupTextCorpus indicates to the user that this is the class to use when he wants to read a Paratext project backup file. The parser/updater classes are lower-level building blocks that are used to implement the higher-level classes, so their names reflect that.

OK, that's fair. Backup is more user-friendly maybe. I do wonder what harm there is in changing the Zip... classes to use the Backup naming scheme, but it's certainly not pressing. There is some marked similarity at this point though now between the structure of those classes and these.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 113 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Alright.

@ddaspit - you understand the range of Paratext projects better. What do you think the default behavior should be? We have no back communication (right now) other than crashing. We could also add a step in our onboarding procedure to make sure that the keyterms are set. We could also add an endpoint to do a "pre-build check" that would verify all of these things.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r4, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 113 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

@ddaspit - you understand the range of Paratext projects better. What do you think the default behavior should be? We have no back communication (right now) other than crashing. We could also add a step in our onboarding procedure to make sure that the keyterms are set. We could also add an endpoint to do a "pre-build check" that would verify all of these things.

Right now, we are only using it for term glosses. So, if the language code isn't set, we don't use glosses, which is what Eli implemented.


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 12 at r4 (raw file):

        public ParatextBackupTermsCorpus(
            ZipArchive archive,

The constructor should accept the path to the zip file.


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 14 at r4 (raw file):

            ZipArchive archive,
            IEnumerable<string> termCategories,
            bool preferTermsLocalization = false

This parameter should be called useTermGlosses.


src/SIL.Machine/Corpora/ParatextTermsCorpusBase.cs line 169 at r4 (raw file):

            //Prefer renderings to gloss localizations
            IDictionary<string, IReadOnlyList<string>> glosses = termsRenderings
                .Concat(termsGlosses.Where(kvp => !termsGlosses.ContainsKey(kvp.Key)))

I don't think the Where predicate is correct. Shouldn't it be !termRenderings.ContainsKey(kvp.Key).

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More fixes to come. Just wanted to commit what I had so far.

Reviewable status: 12 of 16 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 113 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

@ddaspit - you understand the range of Paratext projects better. What do you think the default behavior should be? We have no back communication (right now) other than crashing. We could also add a step in our onboarding procedure to make sure that the keyterms are set. We could also add an endpoint to do a "pre-build check" that would verify all of these things.

(For now, I'll allow it not to be set, barring a new development, but yes, if we do allow such a thing, it might be helpful to have some artifact that things weren't set - even just logging something)


src/SIL.Machine/Corpora/MemoryParatextProjectTermsCorpus.cs line 7 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This feels like a test mock and not a practically useful class. I would move it to the test library.

Done.


src/SIL.Machine/Corpora/ParatextProjectTermsCorpusBase.cs line 37 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Constructors for abstract base classes should be protected.

Done - in a sense haha

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @ddaspit , that was a mid fix commit like I said above. I think I addressed most of your review comments besides the archive vs directory name issue. Let me know.

Reviewable status: 12 of 16 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Double sorry - apparently the message to wait on reviewing didn't go through earlier)

Reviewable status: 12 of 16 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 16 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 12 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The constructor should accept the path to the zip file.

What's the safest/most appropriate way to handle this? Make this class disposable or just open, close, and reopen the archive as needed when Exist and Open are called?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 16 files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 12 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

What's the safest/most appropriate way to handle this? Make this class disposable or just open, close, and reopen the archive as needed when Exist and Open are called?

Hmm, that is an issue. I would prefer that the class isn't disposable. You could refactor ParatextTermsCorpusBase into ParatextTermsParserBase. This would be patterned after the ParatextProjectSettingsParserBase. It would provide a Parse method that returns IEnumerable<(string, string)> You would then implement a ZipParatextTermsParserclass that is used byParatextBackupTermsCorpus`.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 12 of 16 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r6, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 17 at r6 (raw file):

            using (var archive = ZipFile.OpenRead(fileName))
            {
                ParatextProjectSettings settings = new ZipParatextProjectSettingsParser(archive).Parse();

I would just perform the settings parsing in the constructor of ZipParatextTermsParser. See ParatextProjectTextUpdaterBase for an example.


src/SIL.Machine/Corpora/ParatextTermsParserBase.cs line 37 at r6 (raw file):

        private static readonly Regex NumericalInformationRegex = new Regex(@"\s+\d+(\.\d+)*$", RegexOptions.Compiled);

        public IEnumerable<(string, IEnumerable<string>)> Parse(

I would make the return type IEnumerable<(string TermId, IReadOnlyList<string> Glosses)>.


tests/SIL.Machine.Tests/Corpora/ParatextTermsCorpusTests.cs line 8 at r6 (raw file):

[TestFixture]
public class ParatextTermsCorpusTests

Really what you are testing here is ParatextTermsParserBase. It would be better if you tested the MemoryParatextTermsParser class directly and just got rid of the ParatextProjectTermsCorpus class entirely.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 12 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Hmm, that is an issue. I would prefer that the class isn't disposable. You could refactor ParatextTermsCorpusBase into ParatextTermsParserBase. This would be patterned after the ParatextProjectSettingsParserBase. It would provide a Parse method that returns IEnumerable<(string, string)> You would then implement a ZipParatextTermsParserclass that is used byParatextBackupTermsCorpus`.

Done.


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 17 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would just perform the settings parsing in the constructor of ZipParatextTermsParser. See ParatextProjectTextUpdaterBase for an example.

But the settings is needed here to create the text id. Related to my other comment, could we just return the text rows themselves? That would solve this as well.


tests/SIL.Machine.Tests/Corpora/ParatextTermsCorpusTests.cs line 8 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Really what you are testing here is ParatextTermsParserBase. It would be better if you tested the MemoryParatextTermsParser class directly and just got rid of the ParatextProjectTermsCorpus class entirely.

That's fair. It means that those few lines in the ParatextBackupTermsCorpus won't be tested. Is there a reason I couldn't just return the TextRows instead of the list of ids and glosses?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 17 at r6 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

But the settings is needed here to create the text id. Related to my other comment, could we just return the text rows themselves? That would solve this as well.

Good point. I missed that. You can provide the option to pass the settings to the constructor of ZipParatextTermsParser. If the settings is null, then you can use the settings parser, similar to what we have in ParatextProjectTextUpdaterBase.


src/SIL.Machine/Corpora/ParatextTermsParserBase.cs line 12 at r6 (raw file):

namespace SIL.Machine.Corpora
{
    public abstract class ParatextTermsParserBase

You should rename this to ParatextProjectTermsParserBase to match the other PT project classes.


tests/SIL.Machine.Tests/Corpora/ParatextTermsCorpusTests.cs line 8 at r6 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

That's fair. It means that those few lines in the ParatextBackupTermsCorpus won't be tested. Is there a reason I couldn't just return the TextRows instead of the list of ids and glosses?

This test fixture isn't testing the ParatextBackupTermsCorpus class. It is testing the mock. If you want to test the constructor of the ParatextBackupTermsCorpus class, then create a separate test fixture for that class. You should be able to use one of the existing test projects for those tests.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit)


tests/SIL.Machine.Tests/Corpora/ParatextTermsCorpusTests.cs line 8 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This test fixture isn't testing the ParatextBackupTermsCorpus class. It is testing the mock. If you want to test the constructor of the ParatextBackupTermsCorpus class, then create a separate test fixture for that class. You should be able to use one of the existing test projects for those tests.

True, I know it's not really testing it. I was just exploring the idea of returning the rows themselves to consolidate the logic and also simplify the issue above, but I can just do this.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 19 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 17 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Good point. I missed that. You can provide the option to pass the settings to the constructor of ZipParatextTermsParser. If the settings is null, then you can use the settings parser, similar to what we have in ParatextProjectTextUpdaterBase.

OK. Done.


src/SIL.Machine/Corpora/ParatextTermsParserBase.cs line 12 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should rename this to ParatextProjectTermsParserBase to match the other PT project classes.

Done. + child classes


src/SIL.Machine/Corpora/ParatextTermsParserBase.cs line 37 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would make the return type IEnumerable<(string TermId, IReadOnlyList<string> Glosses)>.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 7 of 7 files at r7, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@Enkidu93 Enkidu93 merged commit c93f8dc into master Aug 22, 2024
3 of 4 checks passed
@Enkidu93 Enkidu93 deleted the key_terms branch August 22, 2024 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SF integration: use of Key Biblical Terms as training data
4 participants