Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smt on clearml #200

Merged
merged 4 commits into from
Jun 11, 2024
Merged

Smt on clearml #200

merged 4 commits into from
Jun 11, 2024

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented May 17, 2024

There are a still a bug or two to work out - but it should be good enough to review. There will likely be more churn in review at this point than from bugs.


This change is Reviewable

@johnml1135 johnml1135 requested a review from ddaspit May 17, 2024 20:03
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good.

Reviewed 46 of 46 files at r1, all commit messages.
Reviewable status: all files reviewed, 19 unresolved discussions (waiting on @johnml1135)


src/SIL.Machine.AspNetCore/Configuration/ClearMLBuildJobOptions.cs line 3 at r1 (raw file):

namespace SIL.Machine.AspNetCore.Configuration;

public class ClearMLBuildJobOptions

I would rename this. The Options suffix usually indicates that it is a config model that can be injected into a service. Maybe ClearMLBuildQueue.


src/SIL.Machine.AspNetCore/Configuration/ClearMLBuildJobOptions.cs line 5 at r1 (raw file):

public class ClearMLBuildJobOptions
{
    public const string Key = "ClearMLBuildJob";

I don't think this key is necessary.


src/SIL.Machine.AspNetCore/Configuration/ClearMLEngineTypeOptions.cs line 3 at r1 (raw file):

namespace SIL.Machine.AspNetCore.Configuration;

public class ClearMLEngineTypeOptions

Is this class used anywhere?


src/SIL.Machine.AspNetCore/Configuration/IMachineBuilderExtensions.cs line 361 at r1 (raw file):

    public static IMachineBuilder AddBuildJobService(this IMachineBuilder builder)
    {
        if (builder.Configuration is not null)

This method should support with and without the Configuration property being set, just like it did before.


src/SIL.Machine.AspNetCore/Models/Build.cs line 11 at r1 (raw file):

}

public enum JobRunnerType

I would rename this to BuildJobRunnerType to make it consistent with the other enums.


src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 8 at r1 (raw file):

    public int Revision { get; set; } = 1;
    public required string EngineId { get; init; }
    public required TranslationEngineType Type { get; init; } = TranslationEngineType.Nmt;

A default value isn't necessary.


src/SIL.Machine.AspNetCore/Services/BuildJobService.cs line 81 at r1 (raw file):

            return false;
        }
        TranslationEngine engine = (await _engines.GetAsync(e => e.EngineId == engineId, cancellationToken))!;

You should update the preceding ExistsAsync call to GetAsync, so you don't have to hit the database twice.


src/SIL.Machine.AspNetCore/Services/ClearMLHealthCheck.cs line 11 at r1 (raw file):

    private readonly HttpClient _httpClient = httpClientFactory.CreateClient("ClearML-NoRetry");
    private readonly IClearMLAuthenticationService _clearMLAuthenticationService = clearMLAuthenticationService;
    private readonly IReadOnlyList<string> _queuesMonitored = buildJobOptions

It would probably be easier just to make this a hash set.


src/SIL.Machine.AspNetCore/Services/ClearMLHealthCheck.cs line 28 at r1 (raw file):

            if (!await PingAsync(cancellationToken))
                return HealthCheckResult.Unhealthy("ClearML is unresponsive");
            IEnumerable<string> queuesWithoutWorkers = await QueuesWithoutWorkers(cancellationToken);

I would return a IReadOnlySet and check the Count. LINQ functions can often be inefficient if the underlying type is a collection.


src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 34 at r1 (raw file):

        buildJobOptions.CurrentValue.ClearML.ToDictionary(x => x.TranslationEngineType, x => x.Queue);

    public IDictionary<TranslationEngineType, int> QueueSizePerEngineType { get; private set; } =

I would add a separate interface to expose the queue sizes. Maybe something like IClearMLQueueService. The dictionary could have issues with race conditions. I would use ConcurrentDictionary and just expose the queue size using a method GetQueueSize.


src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 66 at r1 (raw file):

                    .ToDictionary(t => t.Id);
                // add new keys to dictionary
                tasksPerEngineType.ToList().ForEach(x => tasks.TryAdd(x.Key, x.Value));

The ToList() isn't necessary.


src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 74 at r1 (raw file):

                    .ToDictionary(e => e.Task.Name, e => e.Position);
                // add new keys to dictionary
                queuePositionsPerEngineType.ToList().ForEach(x => queuePositions.TryAdd(x.Key, x.Value));

The ToList() isn't necessary.


src/SIL.Machine.AspNetCore/Services/PostprocessBuildJob.cs line 14 at r1 (raw file):

    protected readonly ISharedFileService SharedFileService = sharedFileService;

    public abstract bool GetPretranslationEnabled();

I would just make this a protected property. You could also pass in the value for this to the constructor, so it doesn't have to be overridden.


src/SIL.Machine.AspNetCore/Services/PreprocessBuildJob.cs line 13 at r1 (raw file):

    private Random _random;

    public abstract TranslationEngineType GetEngineType();

I would make these protected properties. You could also pass in the values for this to the constructor, so they don't have to be overridden.


src/SIL.Machine.AspNetCore/Services/PreprocessBuildJob.cs line 85 at r1 (raw file):

        {
            bool canceling = !await BuildJobService.StartBuildJobAsync(
                TrainJobRunnerType,

I think it would be clearer to pass JobRunnerType.ClearML.


src/SIL.Machine.AspNetCore/Services/SmtTransferPostprocessBuildJob.cs line 31 at r1 (raw file):

    )
    {
        await DownloadBuiltEngineAsync(engineId, cancellationToken);

All of the following should happen in single lock:

  • The model updated on disk.
  • The model trained on new segments.
  • The local state updated.
  • The Serval state updated.

src/SIL.Machine.AspNetCore/Services/SmtTransferPostprocessBuildJob.cs line 80 at r1 (raw file):

                return segmentPairs.Count;

            SmtTransferEngineState state = _stateService.Get(engineId);

The SmtTransferEngineStateService was designed for the live inferencing in the engine server. I wouldn't use it here. You can just load the model directly and train the segments.


tests/SIL.Machine.AspNetCore.Tests/Services/SmtTransferEngineServiceTests.cs line 126 at r1 (raw file):

        Assert.That(engine.CurrentBuild, Is.Not.Null);
        Assert.That(engine.CurrentBuild.JobState, Is.EqualTo(BuildJobState.Active));
        await Task.Delay(200);

What is the delay for? This usually indicates that the test might not be deterministic and therefore flaky.


tests/SIL.Machine.AspNetCore.Tests/Services/SmtTransferEngineServiceTests.cs line 274 at r1 (raw file):

            TruecaserTrainer = Substitute.For<ITrainer>();
            EngineOptions = Substitute.For<IOptionsMonitor<SmtTransferEngineOptions>>();
            DirectoryInfo tempDir = Directory.CreateTempSubdirectory();

It is best to avoid hitting the file system in a unit test. What is this for?

@codecov-commenter
Copy link

codecov-commenter commented May 20, 2024

Codecov Report

Attention: Patch coverage is 69.68026% with 275 lines in your changes missing coverage. Please review.

Project coverage is 67.28%. Comparing base (bf9bb16) to head (960ee4b).

Files Patch % Lines
...chine.AspNetCore/Services/ClearMLMonitorService.cs 15.62% 54 Missing ⚠️
...Machine.AspNetCore/Services/ThotSmtModelFactory.cs 0.00% 38 Missing ⚠️
...ne.AspNetCore/Services/SmtTransferTrainBuildJob.cs 77.91% 32 Missing and 4 partials ⚠️
...hine.AspNetCore/Utils/DictionaryStringConverter.cs 0.00% 36 Missing ⚠️
...NetCore/Configuration/IMachineBuilderExtensions.cs 0.00% 34 Missing ⚠️
....Machine.AspNetCore/Services/PreprocessBuildJob.cs 92.16% 19 Missing and 7 partials ⚠️
....Machine.AspNetCore/Services/ClearMLHealthCheck.cs 0.00% 13 Missing ⚠️
...ine.AspNetCore/Services/UnigramTruecaserFactory.cs 0.00% 7 Missing ⚠️
...c/SIL.Machine.AspNetCore/Services/S3WriteStream.cs 0.00% 6 Missing ⚠️
...Core/Services/SmtTransferClearMLBuildJobFactory.cs 83.33% 3 Missing and 2 partials ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #200      +/-   ##
==========================================
+ Coverage   67.22%   67.28%   +0.05%     
==========================================
  Files         441      445       +4     
  Lines       35281    35415     +134     
  Branches     4719     4736      +17     
==========================================
+ Hits        23718    23828     +110     
- Misses      10471    10496      +25     
+ Partials     1092     1091       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Configuration/ClearMLBuildJobOptions.cs line 3 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would rename this. The Options suffix usually indicates that it is a config model that can be injected into a service. Maybe ClearMLBuildQueue.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Configuration/ClearMLBuildJobOptions.cs line 5 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think this key is necessary.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Configuration/ClearMLEngineTypeOptions.cs line 3 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Is this class used anywhere?

Nice catch - was made by accident.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Configuration/IMachineBuilderExtensions.cs line 361 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should support with and without the Configuration property being set, just like it did before.

I believe this is the only change needed.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Models/Build.cs line 11 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would rename this to BuildJobRunnerType to make it consistent with the other enums.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 8 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

A default value isn't necessary.

Ok - especially if we will delete all Machine repo entries.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/BuildJobService.cs line 81 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should update the preceding ExistsAsync call to GetAsync, so you don't have to hit the database twice.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ClearMLHealthCheck.cs line 11 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It would probably be easier just to make this a hash set.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ClearMLHealthCheck.cs line 28 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would return a IReadOnlySet and check the Count. LINQ functions can often be inefficient if the underlying type is a collection.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 34 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would add a separate interface to expose the queue sizes. Maybe something like IClearMLQueueService. The dictionary could have issues with race conditions. I would use ConcurrentDictionary and just expose the queue size using a method GetQueueSize.

I think this is what you are looking for.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 66 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The ToList() isn't necessary.

It did need it - but I refactored to be a little clearer. What do you think?

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 74 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The ToList() isn't necessary.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/PostprocessBuildJob.cs line 14 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would just make this a protected property. You could also pass in the value for this to the constructor, so it doesn't have to be overridden.

Ok - removed the primary constructors.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/PreprocessBuildJob.cs line 13 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would make these protected properties. You could also pass in the values for this to the constructor, so they don't have to be overridden.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/PreprocessBuildJob.cs line 85 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think it would be clearer to pass JobRunnerType.ClearML.

This can be overridden to do hangfire for SMT for testing.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/SmtTransferPostprocessBuildJob.cs line 31 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

All of the following should happen in single lock:

  • The model updated on disk.
  • The model trained on new segments.
  • The local state updated.
  • The Serval state updated.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/SmtTransferPostprocessBuildJob.cs line 80 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The SmtTransferEngineStateService was designed for the live inferencing in the engine server. I wouldn't use it here. You can just load the model directly and train the segments.

How do I load it directly?

@johnml1135
Copy link
Collaborator Author

tests/SIL.Machine.AspNetCore.Tests/Services/SmtTransferEngineServiceTests.cs line 126 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What is the delay for? This usually indicates that the test might not be deterministic and therefore flaky.

It was on windows...

@johnml1135
Copy link
Collaborator Author

tests/SIL.Machine.AspNetCore.Tests/Services/SmtTransferEngineServiceTests.cs line 274 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It is best to avoid hitting the file system in a unit test. What is this for?

I believe we could not use a memory stream - I could be wrong...

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r2, 27 of 27 files at r3, all commit messages.
Reviewable status: all files reviewed, 18 unresolved discussions (waiting on @johnml1135)


src/SIL.Machine.AspNetCore/Configuration/IMachineBuilderExtensions.cs line 361 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I believe this is the only change needed.

This should be what we want:

    public static IMachineBuilder AddBuildJobService(this IMachineBuilder builder, string? smtTransferEngineDir = null)
    {
        builder.Services.AddScoped<IBuildJobService, BuildJobService>();

        builder.Services.AddScoped<IBuildJobRunner, ClearMLBuildJobRunner>();
        builder.Services.AddScoped<IClearMLBuildJobFactory, NmtClearMLBuildJobFactory>();
        builder.Services.AddScoped<IClearMLBuildJobFactory, SmtTransferClearMLBuildJobFactory>();
        builder.Services.AddSingleton<IClearMLQueueService, ClearMLMonitorService>();
        builder.Services.AddHostedService(p => p.GetRequiredService<ClearMLMonitorService>());

        builder.Services.AddScoped<IBuildJobRunner, HangfireBuildJobRunner>();
        builder.Services.AddScoped<IHangfireBuildJobFactory, NmtHangfireBuildJobFactory>();
        builder.Services.AddScoped<IHangfireBuildJobFactory, SmtTransferHangfireBuildJobFactory>();

        if (smtTransferEngineDir is null)
        {
            var smtTransferEngineOptions = new SmtTransferEngineOptions();
            builder.Configuration?.GetSection(SmtTransferEngineOptions.Key).Bind(smtTransferEngineOptions);
            smtTransferEngineDir = smtTransferEngineOptions.EnginesDir;
        }
        string? driveLetter = Path.GetPathRoot(smtTransferEngineDir)?[..1];
        if (driveLetter is null)
            throw new InvalidOperationException("SMT Engine directory is required");
        // add health check for disk storage capacity
        builder
            .Services.AddHealthChecks()
            .AddDiskStorageHealthCheck(
                x => x.AddDrive(driveLetter, 1_000), // 1GB
                "SMT Engine Storage Capacity",
                HealthStatus.Degraded
            );

        return builder;
    }

src/SIL.Machine.AspNetCore/Services/ClearMLHealthCheck.cs line 15 at r3 (raw file):

    private readonly ISet<string> _queuesMonitored = buildJobOptions
        .CurrentValue.ClearML.Select(x => x.Queue)
        .Distinct()

The Distinct call is unnecessary.


src/SIL.Machine.AspNetCore/Services/ClearMLHealthCheck.cs line 82 at r3 (raw file):

    public async Task<IReadOnlySet<string>> QueuesWithoutWorkers(CancellationToken cancellationToken = default)
    {
        ISet<string> queuesWithoutWorkers = _queuesMonitored.ToHashSet();

If you change the type to HashSet<string>, then you don't need to convert to an ImmutableHashSet before returning. HashSet implements the IReadOnlySet interface.


src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 34 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I think this is what you are looking for.

I suggested exposing GetQueueSize instead of the dictionary, because we don't want to expose a mutable dictionary. GetQueueSize is also a simpler API.


src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 66 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

It did need it - but I refactored to be a little clearer. What do you think?

It isn't safe to iterate over a collection while mutating it. I think the following would work:

foreach (KeyValuePair<string, ClearMLTask> kvp in tasksPerEngineType)
    tasks.TryAdd(kvp.Key, kvp.Value);

src/SIL.Machine.AspNetCore/Services/ModelCleanupService.cs line 29 at r3 (raw file):

        // Get all engine ids from the database
        IReadOnlyList<TranslationEngine>? allEngines = await _engines.GetAllAsync(cancellationToken: cancellationToken);
        IEnumerable<string> validFilenames = allEngines

It would probably be good to rename these variables to indicate that these are for NMT models. Maybe validNmtModelFilenames.


src/SIL.Machine.AspNetCore/Services/ModelCleanupService.cs line 38 at r3 (raw file):

            .Select(e => NmtEngineService.GetModelPath(e.EngineId, e.BuildRevision + 1));
        // Add SMT engines
        IEnumerable<string> validSMTModels = allEngines

I don't think this is being used anywhere.


src/SIL.Machine.AspNetCore/Services/NmtEngineService.cs line 172 at r3 (raw file):

    }

    public Task<int> GetQueueSizeAsync(CancellationToken cancellationToken = default)

This method doesn't need to be async anymore.


src/SIL.Machine.AspNetCore/Services/PreprocessBuildJob.cs line 13 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Done.

These properties aren't used outside of this class and the classes that inherit from it, so the properties don't need to be public.


src/SIL.Machine.AspNetCore/Services/PreprocessBuildJob.cs line 7 at r3 (raw file):

    private static readonly JsonWriterOptions PretranslateWriterOptions = new() { Indented = true };

    public BuildJobRunnerType TrainJobRunnerType = BuildJobRunnerType.ClearML;

If this is used only for unit testing, then it should be an internal property.


src/SIL.Machine.AspNetCore/Services/PreprocessBuildJob.cs line 181 at r3 (raw file):

            }

            foreach (Row row in AlignPretranslateCorpus(sourceTextCorpora[0], targetTextCorpus))

We should skip over this loop if pretranslations aren't enabled.


src/SIL.Machine.AspNetCore/Services/SmtTransferPostprocessBuildJob.cs line 80 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

How do I load it directly?

You can look at the original SmtTransferBuildJob class to get an idea.


src/SIL.Machine.AspNetCore/Services/SmtTransferPostprocessBuildJob.cs line 54 at r3 (raw file):

        string extractDir = Path.Combine(_engineOptions.CurrentValue.EnginesDir, engineId);
        Directory.CreateDirectory(extractDir);
        ZipFile.ExtractToDirectory(sharedStream, extractDir, overwriteFiles: true);

File system access should be abstracted behind an interface to make it easier to write unit tests. In this case, the ISmtModelFactory interface would probably be appropriate. It is the interface that is currently used to abstract file access to SMT models.


tests/SIL.Machine.AspNetCore.Tests/Services/NmtEngineServiceTests.cs line 198 at r3 (raw file):

        public NmtEngineService Service { get; private set; }
        public IClearMLQueueService ClearMLMonitorService { get; }

You forgot to rename this property.


tests/SIL.Machine.AspNetCore.Tests/Services/SmtTransferEngineServiceTests.cs line 126 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

It was on windows...

You should be able to figure out how to get the test to work without the delay.


tests/SIL.Machine.AspNetCore.Tests/Services/SmtTransferEngineServiceTests.cs line 274 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I believe we could not use a memory stream - I could be wrong...

See my comment above.


tests/SIL.Machine.AspNetCore.Tests/Services/SmtTransferEngineServiceTests.cs line 356 at r3 (raw file):

        public IClearMLService ClearMLService { get; }
        public IClearMLQueueService ClearMLMonitorService { get; }

You forgot to rename this property.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Configuration/IMachineBuilderExtensions.cs line 361 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be what we want:

    public static IMachineBuilder AddBuildJobService(this IMachineBuilder builder, string? smtTransferEngineDir = null)
    {
        builder.Services.AddScoped<IBuildJobService, BuildJobService>();

        builder.Services.AddScoped<IBuildJobRunner, ClearMLBuildJobRunner>();
        builder.Services.AddScoped<IClearMLBuildJobFactory, NmtClearMLBuildJobFactory>();
        builder.Services.AddScoped<IClearMLBuildJobFactory, SmtTransferClearMLBuildJobFactory>();
        builder.Services.AddSingleton<IClearMLQueueService, ClearMLMonitorService>();
        builder.Services.AddHostedService(p => p.GetRequiredService<ClearMLMonitorService>());

        builder.Services.AddScoped<IBuildJobRunner, HangfireBuildJobRunner>();
        builder.Services.AddScoped<IHangfireBuildJobFactory, NmtHangfireBuildJobFactory>();
        builder.Services.AddScoped<IHangfireBuildJobFactory, SmtTransferHangfireBuildJobFactory>();

        if (smtTransferEngineDir is null)
        {
            var smtTransferEngineOptions = new SmtTransferEngineOptions();
            builder.Configuration?.GetSection(SmtTransferEngineOptions.Key).Bind(smtTransferEngineOptions);
            smtTransferEngineDir = smtTransferEngineOptions.EnginesDir;
        }
        string? driveLetter = Path.GetPathRoot(smtTransferEngineDir)?[..1];
        if (driveLetter is null)
            throw new InvalidOperationException("SMT Engine directory is required");
        // add health check for disk storage capacity
        builder
            .Services.AddHealthChecks()
            .AddDiskStorageHealthCheck(
                x => x.AddDrive(driveLetter, 1_000), // 1GB
                "SMT Engine Storage Capacity",
                HealthStatus.Degraded
            );

        return builder;
    }

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ClearMLHealthCheck.cs line 82 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

If you change the type to HashSet<string>, then you don't need to convert to an ImmutableHashSet before returning. HashSet implements the IReadOnlySet interface.

explicit cast...

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 34 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I suggested exposing GetQueueSize instead of the dictionary, because we don't want to expose a mutable dictionary. GetQueueSize is also a simpler API.

Makes sense - that would be a more standard interface.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 66 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It isn't safe to iterate over a collection while mutating it. I think the following would work:

foreach (KeyValuePair<string, ClearMLTask> kvp in tasksPerEngineType)
    tasks.TryAdd(kvp.Key, kvp.Value);

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ClearMLHealthCheck.cs line 15 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The Distinct call is unnecessary.

Hashset already makes it distinct...

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ModelCleanupService.cs line 29 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It would probably be good to rename these variables to indicate that these are for NMT models. Maybe validNmtModelFilenames.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 27 of 31 files at r7, 4 of 4 files at r9, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @johnml1135)


src/SIL.Machine.AspNetCore/Models/ClearMLTask.cs line 37 at r9 (raw file):

}

internal sealed class DictionaryStringStringConverter : JsonConverter<IReadOnlyDictionary<string, string>>

This should be moved to the SIL.Machine.AspNetCore.Utils namespace.


src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 164 at r9 (raw file):

                                engine.EngineId,
                                engine.CurrentBuild.BuildId,
                                (int)GetMetric(task, SummaryMetric, TrainCorpusSizeVariant),

We need to test that the train corpus size and confidence are correctly set in the Serval build.


src/SIL.Machine.AspNetCore/Services/PreprocessBuildJob.cs line 202 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This is happening, because you aren't using IAsyncDisposable.DisposeAsync(). The correct way to fix this is to use await using when you create the stream and writer.

Done


src/SIL.Machine.AspNetCore/Services/S3WriteStream.cs line 50 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It is dangerous to call async methods from sync methods. We should revert these changes.

Done


src/SIL.Machine.AspNetCore/Services/S3WriteStream.cs line 1 at r9 (raw file):

using CommunityToolkit.HighPerformance;

This should be moved to the global usings.


src/SIL.Machine.AspNetCore/Services/SmtTransferPostprocessBuildJob.cs line 25 at r9 (raw file):

        await using (
            Stream engineStream = await SharedFileService.OpenReadAsync(
                $"builds/{buildId}/model.zip",

We need to figure out where to save the model on the S3 bucket. Right now, I believe that Machine.py is saving the model to the models directory. I don't know if this is the best place to save the model. The models directory is currently being used for downloading models. This is a different use case. Here we simply need to transfer the model from ClearML to the Machine engine. I think it would probably be better to save the model to the builds directory, which is what this code currently expects.


src/SIL.Machine.AspNetCore/Services/ThotSmtModelFactory.cs line 72 at r9 (raw file):

        if (!Directory.Exists(engineDir))
            Directory.CreateDirectory(engineDir);
        ZipFile.ExtractToDirectory(source, engineDir, overwriteFiles: true);

Machine.py is now saving the model as .tar.gz. We need to update this code accordingly.


src/SIL.Machine.AspNetCore/Services/SmtTransferPreprocessBuildJob.cs line 18 at r6 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This is hard coded - it should be configurable through either build options (probably not) or environment variables. Right now, with the python SMT so slow, the E2E tests will time out. We need to either speed up the python code or run the SMT in hangfire for E2E testing.

Done

@johnml1135 johnml1135 force-pushed the smt_on_clearml branch 3 times, most recently from 4a80d86 to a1b7210 Compare June 7, 2024 19:35
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r11, 3 of 3 files at r12, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @johnml1135)

* Replace CPU, GPU types with just Hangfire vs ClearML as well as engine type
* Allow each engine type to have it's own queue and docker image
* SMT build defaults on ClearML
* NMT local train removed
* Use .zip for SMT model moving
* Update cleanup script for SMT models
* Download and upload model in factory
@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Models/ClearMLTask.cs line 37 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be moved to the SIL.Machine.AspNetCore.Utils namespace.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs line 164 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We need to test that the train corpus size and confidence are correctly set in the Serval build.

Added to E2E test to verify corpus size and confidence.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/S3WriteStream.cs line 1 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be moved to the global usings.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/SmtTransferPostprocessBuildJob.cs line 25 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We need to figure out where to save the model on the S3 bucket. Right now, I believe that Machine.py is saving the model to the models directory. I don't know if this is the best place to save the model. The models directory is currently being used for downloading models. This is a different use case. Here we simply need to transfer the model from ClearML to the Machine engine. I think it would probably be better to save the model to the builds directory, which is what this code currently expects.

I already moved it to the builds directory for both machine and machine.py. I agree - I want the SMT model to auto-delete the same way the other build files auto-delete at the end of the build job.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ThotSmtModelFactory.cs line 72 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Machine.py is now saving the model as .tar.gz. We need to update this code accordingly.

I changed it back to .zip. Dotnet handles .zip easier than tar.gz and python doesn't care.

Copy link
Collaborator Author

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 46 files at r1, 1 of 27 files at r3, 11 of 31 files at r7.
Reviewable status: 59 of 66 files reviewed, 7 unresolved discussions (waiting on @ddaspit)


src/SIL.Machine.AspNetCore/Services/PreprocessBuildJob.cs line 7 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ah, the things I don't know about C#... how can I count they ways, as long as the count property is public because I am not part of the assembly.

Actually, I do want it to be public -

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ThotSmtModelFactory.cs line 72 at r9 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I changed it back to .zip. Dotnet handles .zip easier than tar.gz and python doesn't care.

Correction - brought back to gzip - just needed to use a new dotnet library to make the code of sensible size.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r13, 4 of 4 files at r14.
Reviewable status: 59 of 67 files reviewed, 3 unresolved discussions (waiting on @johnml1135)


src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 8 at r6 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This may break because of database migration. Current plan:

  • bring down machine
  • delete all machine database entries
  • Bring up the new machine

I don't think that deleting the Machine database is an option. It would break all of the existing Serval engines. I believe that this is the only breaking DB change. I actually think that we might be able to get by without this property. It is only used when starting a build job. We should be able to pass in the engine type when we call IBuildJobService.StartBuildJobAsync. Then one place where it is tricky is when we start the postprocess job from the ClearMLMonitorService. If we have some way to differentiate between the engine types from the ClearML task metadata, then it should be doable.


src/SIL.Machine.AspNetCore/Services/ThotSmtModelFactory.cs line 72 at r9 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I changed it back to .zip. Dotnet handles .zip easier than tar.gz and python doesn't care.

You can use the GZipStream and TarFile classes to extract the .tar.gz file. Here is an example that uses TarReader, but it should be easy to adapt to TarFile.

- refactored SMT engine classes
- more consistent model factories
- fixed some dispose issues
- Preserve changes from #205.
- change SMT model save location to be in build - always do it and auto-save when delete.
- Revert to tar.gz
- Remove SMT model cleanup (unneeded)
@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 8 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think that deleting the Machine database is an option. It would break all of the existing Serval engines. I believe that this is the only breaking DB change. I actually think that we might be able to get by without this property. It is only used when starting a build job. We should be able to pass in the engine type when we call IBuildJobService.StartBuildJobAsync. Then one place where it is tricky is when we start the postprocess job from the ClearMLMonitorService. If we have some way to differentiate between the engine types from the ClearML task metadata, then it should be doable.

The only concern would be historic jobs that died during the job but because of the hangfire restart issue, are just languishing in an incomplete state. When the hangfire issue is fixed, the jobs will start back up again and will fail because they don't have a translation engine type. That is why I am considering just killing the jobs that never completed.
In terms of differentiating, I don't know if we can reliably. We could update the database manually by stripping the engine type from Serval and putting into the mongo DB, but that would be an update script, or a database conversion. Do we want to go down that route? Again, I recommend just deleting all jobs from Machine.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/ThotSmtModelFactory.cs line 72 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can use the GZipStream and TarFile classes to extract the .tar.gz file. Here is an example that uses TarReader, but it should be easy to adapt to TarFile.

done (by you).

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 8 files at r15, 3 of 3 files at r16, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)


src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 8 at r6 (raw file):

Previously, johnml1135 (John Lambert) wrote…

The only concern would be historic jobs that died during the job but because of the hangfire restart issue, are just languishing in an incomplete state. When the hangfire issue is fixed, the jobs will start back up again and will fail because they don't have a translation engine type. That is why I am considering just killing the jobs that never completed.
In terms of differentiating, I don't know if we can reliably. We could update the database manually by stripping the engine type from Serval and putting into the mongo DB, but that would be an update script, or a database conversion. Do we want to go down that route? Again, I recommend just deleting all jobs from Machine.

We should meet to come up with a plan.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 8 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We should meet to come up with a plan.

  1. Write python script to delete "currentBuild" from every translation engine and remove the machine_jobs hangfire database.
  2. Test on Internal QA: using 1.4.5, (1) run a set of E2E tests. (2) Spin down machine. (3) Run Python script (6) spin up 1.5.0 (7) rebuild an existing SMT and NMT engine. (8) run all E2E tests again.
  3. For production, (1) make sure there are no running builds (2) spin down Serval (3) run script (4) Spin up Serval 1.5

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 8 at r6 (raw file):

Previously, johnml1135 (John Lambert) wrote…
  1. Write python script to delete "currentBuild" from every translation engine and remove the machine_jobs hangfire database.
  2. Test on Internal QA: using 1.4.5, (1) run a set of E2E tests. (2) Spin down machine. (3) Run Python script (6) spin up 1.5.0 (7) rebuild an existing SMT and NMT engine. (8) run all E2E tests again.
  3. For production, (1) make sure there are no running builds (2) spin down Serval (3) run script (4) Spin up Serval 1.5

sillsdev/serval#407

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to revert the last commit, then I will approve.

Reviewed all commit messages.
Reviewable status: 63 of 67 files reviewed, all discussions resolved (waiting on @johnml1135)

@johnml1135
Copy link
Collaborator Author

Reverted.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r18, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

@johnml1135 johnml1135 merged commit 8c39e3f into master Jun 11, 2024
4 checks passed
@johnml1135 johnml1135 deleted the smt_on_clearml branch June 11, 2024 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants