Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ClearML NMT build job #101

Merged
merged 1 commit into from
Oct 6, 2023
Merged

Refactor ClearML NMT build job #101

merged 1 commit into from
Oct 6, 2023

Conversation

ddaspit
Copy link
Contributor

@ddaspit ddaspit commented Oct 3, 2023

  • add support for multiple build stages
  • add support for running build jobs on Hangfire or ClearML
  • add BuildJobService
  • categorize build jobs into CPU or GPU jobs
  • decouple build job runners from translation engines
  • fix issues with S3FileStorage
  • fix issues with ClearMLService

This change is Reviewable

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 66 of 66 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Configuration/IMachineBuilderExtensions.cs line 72 at r1 (raw file):

            return builder.AddThotSmtModel(o => { });
        else
            return builder.AddThotSmtModel(builder.Configuration.GetSection(ThotSmtModelOptions.Key));

Damien - can you explain what is happening here? Modifying the SMT models is not specified in the commit messages.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 9 at r1 (raw file):

    public string EngineId { get; set; } = default!;
    public string SourceLanguage { get; set; } = default!;
    public string TargetLanguage { get; set; } = default!;

Won't this break all existing builds in the Mongo DB? Should just clear everything out or do we need a migration plan?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLAuthenticationService.cs line 44 at r1 (raw file):

    protected override void Started()
    {
        _logger.LogInformation("ClearML authentication service started.");

These functions do nothing. Should they be removed?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 14 of 66 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit)

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 32 at r1 (raw file):

    protected override void Started()
    {
        _logger.LogInformation("ClearML monitor service started.");

Are these supposed to do more than just log?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 32 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Are these supposed to do more than just log?

Could this be defined once in RecurrentTask and only overridden if necessary?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 23 at r1 (raw file):

    }

    [Queue("nmt")]

Can this be thread limited to 1.5 threads? Or configurable from app settings as a float?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/SIL.Machine.AspNetCore.csproj line 37 at r1 (raw file):

		<PackageReference Include="Microsoft.AspNetCore.Mvc.NewtonsoftJson" Version="6.0.16" />
		<PackageReference Include="Microsoft.Extensions.Http.Polly" Version="6.0.14" />
		<PackageReference Include="Python.Included" Version="3.11.4" />

Can you add a issue for completing this Python code? If it is incomplete, it should have an attached issue.

@ddaspit ddaspit force-pushed the refactor-clearml-job branch from 957a85d to e3dbe61 Compare October 4, 2023 22:36
Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 54 of 66 files reviewed, 6 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine.AspNetCore/SIL.Machine.AspNetCore.csproj line 37 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Can you add a issue for completing this Python code? If it is incomplete, it should have an attached issue.

Done.


src/SIL.Machine.AspNetCore/Configuration/IMachineBuilderExtensions.cs line 72 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Damien - can you explain what is happening here? Modifying the SMT models is not specified in the commit messages.

This is just a refactoring of the configuration for the Thot SMT model to make it more consistent with the way that other services are configured.


src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 9 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Won't this break all existing builds in the Mongo DB? Should just clear everything out or do we need a migration plan?

That is a good question. Migrating would be simple. I think it would only require us to remove the BuildState and IsCanceled properties from all engines.


src/SIL.Machine.AspNetCore/Services/ClearMLAuthenticationService.cs line 44 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

These functions do nothing. Should they be removed?

Done.


src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 32 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Could this be defined once in RecurrentTask and only overridden if necessary?

Done.


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 23 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Can this be thread limited to 1.5 threads? Or configurable from app settings as a float?

You can configure the number of workers that are available for Hangfire. The default is Math.Min(Environment.ProcessorCount * 5, MaxDefaultWorkerCount).

@codecov-commenter
Copy link

codecov-commenter commented Oct 4, 2023

Codecov Report

Attention: 674 lines in your changes are missing coverage. Please review.

Files Coverage Δ
...achine.AspNetCore/Configuration/BuildJobOptions.cs 100.00% <100.00%> (ø)
...Machine.AspNetCore/Configuration/ClearMLOptions.cs 100.00% <100.00%> (ø)
...pNetCore/Configuration/SmtTransferEngineOptions.cs 0.00% <ø> (-80.00%) ⬇️
...ne.AspNetCore/Configuration/ThotSmtModelOptions.cs 0.00% <ø> (ø)
src/SIL.Machine.AspNetCore/Models/Build.cs 100.00% <100.00%> (ø)
...SIL.Machine.AspNetCore/Models/TranslationEngine.cs 100.00% <100.00%> (ø)
...SIL.Machine.AspNetCore/Services/InMemoryStorage.cs 85.89% <100.00%> (-2.41%) ⬇️
...rc/SIL.Machine.AspNetCore/Services/LocalStorage.cs 100.00% <100.00%> (+2.73%) ⬆️
...pNetCore/Services/SmtTransferEngineStateService.cs 88.88% <100.00%> (+10.51%) ⬆️
src/SIL.Machine/Utils/TempDirectory.cs 100.00% <ø> (ø)
... and 27 more

... and 4 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 169 at r2 (raw file):

        if (metrics is null)
            throw new InvalidOperationException("Malformed response from ClearML server.");
        var performanceMetrics = (JsonObject?)metrics.FirstOrDefault(m => (string?)m?["name"] == "metrics");

So, you have the new metrics code - does it do the same thing? Where is all the parsing of the results? I can't find it.

Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 54 of 66 files reviewed, 7 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 169 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, you have the new metrics code - does it do the same thing? Where is all the parsing of the results? I can't find it.

There is a GetMetric method in the ClearMLMonitorService class that reads a metric from a task.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/HangfireBuildJobRunner.cs line 37 at r2 (raw file):

    public Task CreateEngineAsync(string engineId, string? name = null, CancellationToken cancellationToken = default)
    {
        return Task.CompletedTask;

How is this creating an engine?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/HangfireBuildJobRunner.cs line 37 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

How is this creating an engine?

Ah it's for clearml - it doesn't exist for hangfire.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtEngineService.cs line 64 at r2 (raw file):

    public async Task DeleteAsync(string engineId, CancellationToken cancellationToken = default)
    {
        await _dataAccessContext.BeginTransactionAsync(cancellationToken);

Should we first check to see if we need to cancel any jobs? Is that done somewhere else?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 23 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can configure the number of workers that are available for Hangfire. The default is Math.Min(Environment.ProcessorCount * 5, MaxDefaultWorkerCount).

sillsdev/serval#154 - opened an issue to see if this will be a problem.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtTrainBuildJob.cs line 60 at r2 (raw file):

            using (Py.GIL())
            {
                PythonEngine.Exec(

We need to finish this here: #105.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 69 at r2 (raw file):

    public async Task DeleteAsync(string engineId, CancellationToken cancellationToken = default)
    {

There is no code coverage in CI for this. Probably a test or two could be added to cover deleting an engine with a build running.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 118 at r2 (raw file):

    }

    public async Task<WordGraph> GetWordGraphAsync(

There is no CI coverage for this function.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 150 at r2 (raw file):

            TranslationEngine engine = await GetEngineAsync(engineId, cancellationToken);
            if (engine.CurrentBuild?.JobState is BuildJobState.Active)
            {

I know there are almost no changes, but there are no CI tests for training a segment while the engine is building.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 223 at r2 (raw file):

    )
    {
        using ISubscription<TranslationEngine> sub = await _engines.SubscribeAsync(

There are no tests to cover this function.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 223 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

There are no tests to cover this function.

Though it should be covered in CancelBuildAsync. What gives?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 223 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Though it should be covered in CancelBuildAsync. What gives?

Nevermind - it's covered in DeleteAsync when a job is currently running, is cancelled and then we need to wait for it to finish up. A test covering that should suffice.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/SmtTransferEngineStateService.cs line 51 at r2 (raw file):

                if (
                    engine is not null
                    && (engine.CurrentBuild is null || engine.CurrentBuild.JobState is BuildJobState.Pending)

Any particular reason for the change in logic here? Why current build is null or jobstate is pending from just not active?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.Serval.EngineServer/appsettings.json line 21 at r2 (raw file):

  "ClearML": {
    "ApiServer": "https://api.sil.hosted.allegro.ai",
    "Queue": "production",

This should be overridden in either development app settings or in env variables. I would remove queue and docker image - let the program fail if they are not defined.

@johnml1135
Copy link
Collaborator

tests/SIL.Machine.AspNetCore.Tests/Services/NmtEngineServiceTests.cs line 49 at r2 (raw file):

        Assert.That(engine.BuildRevision, Is.EqualTo(1));
    }

There should be a test to delete an engine after a build has started.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/SIL.Machine.AspNetCore.csproj line 37 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Done.

I started an issue here - #105. Where is your issue?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 9 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

That is a good question. Migrating would be simple. I think it would only require us to remove the BuildState and IsCanceled properties from all engines.

Or, would the states just be ignored? Should they be transitioned? Maybe keep the old parameters and check at initialization if the depreciated fields are there - and then transition them?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 169 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

There is a GetMetric method in the ClearMLMonitorService class that reads a metric from a task.

Whatever it is, it looks more elegant. I am assuming it will get tested in E2E testing.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 42 of 66 files at r1, 12 of 12 files at r2, all commit messages.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @ddaspit)

@ddaspit ddaspit force-pushed the refactor-clearml-job branch from e3dbe61 to dfcce0e Compare October 5, 2023 22:17
Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 58 of 66 files reviewed, 12 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine.AspNetCore/SIL.Machine.AspNetCore.csproj line 37 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I started an issue here - #105. Where is your issue?

The issue is #103.


src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 9 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Or, would the states just be ignored? Should they be transitioned? Maybe keep the old parameters and check at initialization if the depreciated fields are there - and then transition them?

I don't think it is worth migrating them. They are only used when a build is running. When we upgrade, we will shut everything down, so there won't be any running builds. If we don't remove them, the Mongo driver will throw an exception when it tries to deserialize a translation engine doc into a model. I added SetIgnoreExtraElements(true) to the TranslationEngine map. This will cause the Mongo driver to ignore the old properties. We won't need to migrate or remove the properties.


src/SIL.Machine.AspNetCore/Services/NmtEngineService.cs line 64 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Should we first check to see if we need to cancel any jobs? Is that done somewhere else?

Done.


src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 69 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

There is no code coverage in CI for this. Probably a test or two could be added to cover deleting an engine with a build running.

Done.


src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 118 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

There is no CI coverage for this function.

Done.


src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 150 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I know there are almost no changes, but there are no CI tests for training a segment while the engine is building.

Done.


src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 223 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Nevermind - it's covered in DeleteAsync when a job is currently running, is cancelled and then we need to wait for it to finish up. A test covering that should suffice.

Done.


src/SIL.Machine.AspNetCore/Services/SmtTransferEngineStateService.cs line 51 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Any particular reason for the change in logic here? Why current build is null or jobstate is pending from just not active?

This was updated, because of the addition of the canceling state.


src/SIL.Machine.Serval.EngineServer/appsettings.json line 21 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

And remove s3 uri - that changes based upon environment.

Done.


src/SIL.Machine.Serval.JobServer/appsettings.json line 22 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Remove queue, docker image and sharedfile.

Done.


src/SIL.Machine.Serval.JobServer/appsettings.json line 23 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Actually, it will throw an exception when making the periodic timer: https://learn.microsoft.com/en-us/dotnet/api/system.threading.periodictimer.-ctor?view=net-7.0

This disables the ClearMLMonitorService on the job server. I put in a check for a zero TimeSpan. There is no need for it to run on both the engine server and the job server.


tests/SIL.Machine.AspNetCore.Tests/Services/NmtEngineServiceTests.cs line 49 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

There should be a test to delete an engine after a build has started.

Done.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Configuration/IMachineBuilderExtensions.cs line 224 at r3 (raw file):

                o.AddRepository<TranslationEngine>(
                    "translation_engines",
                    mapSetup: m => m.SetIgnoreExtraElements(true),

If I am correct, this should allow it to not crash with upgrading - though it will still remove all builds...

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtEngineService.cs line 151 at r3 (raw file):

        );
        if (buildId is not null && jobState is BuildJobState.None)
            await _platformService.BuildCanceledAsync(buildId, CancellationToken.None);

I am confused why it is broken up. Would we ever cancel a build job instead of a build? What does cancelling a build do over and above just cancelling the job?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 39 at r3 (raw file):

    public TranslationEngineType Type => TranslationEngineType.SmtTransfer;

    public async Task CreateAsync(

Now CreateAsync is showing as not covered by tests - weird: https://app.codecov.io/gh/sillsdev/machine/blob/refactor-clearml-job/src%2FSIL.Machine.AspNetCore%2FServices%2FSmtTransferEngineService.cs

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 197 at r3 (raw file):

        await using (await @lock.WriterLockAsync(cancellationToken: cancellationToken))
        {
            await CancelBuildJobAsync(engineId, cancellationToken);

Ok the same question for breaking out the cancel build and cancel build job. The only difference is the lock and updating the state? Why? Why would I want to call one or the other?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.Serval.JobServer/appsettings.json line 23 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This disables the ClearMLMonitorService on the job server. I put in a check for a zero TimeSpan. There is no need for it to run on both the engine server and the job server.

Can we have a parameter "enable polling" defaulted to false and instead place this on the engine server? I believe that would be clearer.

@johnml1135
Copy link
Collaborator

tests/SIL.Machine.AspNetCore.Tests/Services/NmtEngineServiceTests.cs line 49 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Done.

Sorry to be finicky - can the test be called DeleteWhileBuildingAsync or similar?

@johnml1135
Copy link
Collaborator

tests/SIL.Machine.AspNetCore.Tests/Services/SmtTransferEngineServiceTests.cs line 98 at r3 (raw file):

    [Test]
    public async Task DeleteAsync()

DeleteWhileBuildingAsync?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 223 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Done.

How did you get around this function (to be able to delete it)?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit)

@ddaspit ddaspit force-pushed the refactor-clearml-job branch from ecc166f to a34ec44 Compare October 6, 2023 16:00
Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 59 of 66 files reviewed, 7 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine.AspNetCore/Configuration/IMachineBuilderExtensions.cs line 224 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

If I am correct, this should allow it to not crash with upgrading - though it will still remove all builds...

Yes, Mongo won't throw an exception, because an engine doc contains unrecognized properties. The only engines that will be in a bad state are those that had running builds when the servers were restarted. We should ensure that there aren't any active builds when we perform the upgrade, and we should be fine.


src/SIL.Machine.AspNetCore/Services/NmtEngineService.cs line 151 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I am confused why it is broken up. Would we ever cancel a build job instead of a build? What does cancelling a build do over and above just cancelling the job?

This private method allows me to reuse the same code in the CancelBuildAsync and DeleteAsync methods.


src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 223 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

How did you get around this function (to be able to delete it)?

The distributed lock ensures that we are in a good state to delete the engine, so we don't need to wait until the build is completely finished. I never really liked waiting, since it could cause the Serval delete request to hang.


src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 39 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Now CreateAsync is showing as not covered by tests - weird: https://app.codecov.io/gh/sillsdev/machine/blob/refactor-clearml-job/src%2FSIL.Machine.AspNetCore%2FServices%2FSmtTransferEngineService.cs

Done.


src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs line 197 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok the same question for breaking out the cancel build and cancel build job. The only difference is the lock and updating the state? Why? Why would I want to call one or the other?

This private method allows me to reuse the same code in the CancelBuildAsync and DeleteAsync methods.


src/SIL.Machine.Serval.JobServer/appsettings.json line 23 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Can we have a parameter "enable polling" defaulted to false and instead place this on the engine server? I believe that would be clearer.

Done.


tests/SIL.Machine.AspNetCore.Tests/Services/NmtEngineServiceTests.cs line 49 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Sorry to be finicky - can the test be called DeleteWhileBuildingAsync or similar?

Done.


tests/SIL.Machine.AspNetCore.Tests/Services/SmtTransferEngineServiceTests.cs line 98 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

DeleteWhileBuildingAsync?

Done.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtEngineService.cs line 151 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This private method allows me to reuse the same code in the CancelBuildAsync and DeleteAsync methods.

Ah, one is private, one is public. I see.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r5, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

@johnml1135
Copy link
Collaborator

Some failing builds - https://github.com/sillsdev/machine/actions/runs/6434169750/job/17472827308. We should really run e2e on pull requests.

- add support for multiple build stages
- add support for running build jobs on Hangfire or ClearML
- add BuildJobService
- categorize build jobs into CPU or GPU jobs
- decouple build job runners from translation engines
- fix issues with S3FileStorage
- fix issues with ClearMLService
@ddaspit ddaspit force-pushed the refactor-clearml-job branch from a34ec44 to c8a27cc Compare October 6, 2023 18:05
Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out why the E2E tests are failing. The appsettings changed and need to be updated in the Serval repo. In the future, it would be great if we could find a way to keep the appsettings for Machine in the Machine repo.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

@ddaspit ddaspit merged commit fe5b499 into master Oct 6, 2023
3 of 4 checks passed
@ddaspit ddaspit deleted the refactor-clearml-job branch October 6, 2023 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants