-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #464 - add lock timeout #473
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)
src/Machine/src/Serval.Machine.Shared/Configuration/ServiceOptions.cs
line 8 at r1 (raw file):
public string ServiceId { get; set; } = "machine_api"; public TimeSpan ReadWriteLockTimeout { get; set; } = TimeSpan.FromSeconds(55);
I don't think we can use the same timeout for all locks. Some locks are acquired to perform a short task and others are acquired to perform a long task. We will probably need to go through every use of the distributed lock and provide an appropriate lifetime.
Previously, ddaspit (Damien Daspit) wrote…
We are expecting them to not need the timeout at all - we had "infinite" before. What this is really only doing is (hopefully) not letting the whole server be brought down by a lock not being released and giving us breadcrumbs for figuring out which lock was the offender. Responsiveness is not the top thing on the list for me, but rather simplicity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we haven't done profiling on how long it takes to perform many of these operations, I would feel more comfortable starting with more conservative timeouts/lifetimes. Also, have we verified what happens when a request timeout occurs? Does the cancellation token get triggered? What status code is returned?
Reviewed 1 of 3 files at r2, 8 of 14 files at r3, 24 of 24 files at r4, 4 of 4 files at r5, all commit messages.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @johnml1135)
src/Machine/src/Serval.Machine.Shared/Serval.Machine.Shared.csproj
line 49 at r5 (raw file):
<ProjectReference Include="..\..\..\DataAccess\src\SIL.DataAccess\SIL.DataAccess.csproj" /> <ProjectReference Include="..\..\..\Serval\src\Serval.Grpc\Serval.Grpc.csproj" /> <ProjectReference Include="..\..\..\Serval\src\Serval.Shared\Serval.Shared.csproj" />
We should not be referencing the Serval.Shared
library in the Machine engine.
src/Machine/src/Serval.Machine.Shared/Services/DistributedReaderWriterLock.cs
line 41 at r5 (raw file):
{ string lockId = _idGenerator.GenerateId(); StackTrace stackTrace = new StackTrace();
I think you accidentally left this debugging code in.
src/Machine/src/Serval.Machine.Shared/Services/SmtTransferBuildJob.cs
line 117 at r5 (raw file):
await using ( await @lock.WriterLockAsync( lifetime: _timeoutOptions.LongProcessLockLifetime,
Is this the only place where a longer lifetime is needed? Have you checked every place where a lock is acquired?
src/Serval/src/Serval.Shared/Configuration/TimeoutOptions.cs
line 3 at r4 (raw file):
namespace Serval.Shared.Configuration; public class TimeoutOptions
This is mixing Serval and Machine timeouts. The Machine timeouts should be specified in the Machine server.
src/Serval/src/Serval.Shared/Configuration/TimeoutOptions.cs
line 5 at r4 (raw file):
public class TimeoutOptions { public const string Key = "Api";
We should change this to Timeout
.
src/Serval/src/Serval.Shared/Configuration/TimeoutOptions.cs
line 7 at r4 (raw file):
public const string Key = "Api"; public TimeSpan DefaultHttpRequestTimeout { get; set; } = TimeSpan.FromSeconds(10);
I think we should use the same request timeout for everything. The main purpose of the request timeout is to avoid unnecessarily continuing to process a request if Cloudflare kills the connection due to a timeout. Cloudflare's timeout seems to be 100 seconds, so we should just set all of our request timeouts to something around that, maybe 90 seconds.
src/Serval/src/Serval.Shared/Configuration/TimeoutOptions.cs
line 10 at r4 (raw file):
public TimeSpan LongHttpRequestTimeout { get; set; } = TimeSpan.FromSeconds(58); // must be less than 60 seconds Cloudflare timeout public TimeSpan LongPollTimeout { get; set; } = TimeSpan.FromSeconds(40); // must be less than LongHttpRequestTimeout public TimeSpan DefaultLockTimeout { get; set; } = TimeSpan.FromSeconds(8); // must be less than DefaultHttpRequestTimeout
Do we need a lock timeout if we have a request timeout?
src/Serval/src/Serval.Shared/Configuration/TimeoutOptions.cs
line 11 at r4 (raw file):
public TimeSpan LongPollTimeout { get; set; } = TimeSpan.FromSeconds(40); // must be less than LongHttpRequestTimeout public TimeSpan DefaultLockTimeout { get; set; } = TimeSpan.FromSeconds(8); // must be less than DefaultHttpRequestTimeout public TimeSpan DefaultLockLifetime { get; set; } = TimeSpan.FromSeconds(6); // must be less than DefaultLockTimeout
We should be more conservative with the default lifetime. Maybe 60 seconds.
src/Serval/src/Serval.Shared/Configuration/TimeoutOptions.cs
line 12 at r4 (raw file):
public TimeSpan DefaultLockTimeout { get; set; } = TimeSpan.FromSeconds(8); // must be less than DefaultHttpRequestTimeout public TimeSpan DefaultLockLifetime { get; set; } = TimeSpan.FromSeconds(6); // must be less than DefaultLockTimeout public TimeSpan LongHttpLockLifetime { get; set; } = TimeSpan.FromSeconds(56); // must be less than HttpLongRequestTimeout
I don't think we need this.
src/Serval/src/Serval.Shared/Configuration/TimeoutOptions.cs
line 13 at r5 (raw file):
public TimeSpan DefaultLockLifetime { get; set; } = TimeSpan.FromSeconds(6); // must be less than DefaultLockTimeout public TimeSpan LongHttpLockLifetime { get; set; } = TimeSpan.FromSeconds(56); // must be less than HttpLongRequestTimeout public TimeSpan LongProcessLockLifetime { get; set; } = TimeSpan.FromSeconds(120);
I would either remove this option or make the option name more specific to where it is used. I don't think it is useful to have a general "long" lifetime. We should tailor the lifetime to the particular task.
src/Serval/src/Serval.Shared/Utils/TaskEx.cs
line 22 at r4 (raw file):
} public static async Task Timeout(
This should return a boolean indicating if the operation timed out, so that the calling code can decide what to do about it.
Previously, ddaspit (Damien Daspit) wrote…
Should the Timeout routine move somewhere else then? |
Previously, ddaspit (Damien Daspit) wrote…
Removed |
Previously, ddaspit (Damien Daspit) wrote…
I have briefly and did some testing. I also changed the Timeout for the long polling. |
Previously, ddaspit (Damien Daspit) wrote…
It is 60 (I am confident of that). I am fine setting all HTTP request timeouts to 58 seconds - it minimizes the impact. |
Previously, johnml1135 (John Lambert) wrote…
Now there is only one policy at 58 seconds. |
Previously, ddaspit (Damien Daspit) wrote…
we shouldn't actually need it - it may be helpful to know whether the task failed because it's own lifetime expired or a lock was not released. It may not need that logic though - now that ExpiresAt is mandatory. |
Previously, ddaspit (Damien Daspit) wrote…
Or, as you said above, we could remove it. |
Previously, ddaspit (Damien Daspit) wrote…
This is used for post training an SMT model. I can rename and move it to a new location. |
Previously, ddaspit (Damien Daspit) wrote…
Ok - should it be consistent with both Timeouts? |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Partial fix uploaded. We can discuss more tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 12 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @johnml1135)
src/Machine/src/Serval.Machine.Shared/Serval.Machine.Shared.csproj
line 49 at r5 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Should the Timeout routine move somewhere else then?
If we remove the timeout from the lock, then we don't need this at all.
src/Serval/src/Serval.Shared/Configuration/TimeoutOptions.cs
line 10 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
we shouldn't actually need it - it may be helpful to know whether the task failed because it's own lifetime expired or a lock was not released. It may not need that logic though - now that ExpiresAt is mandatory.
Yeah, I feel like the request timeout will handle this. It would be good to bail out of a task inside of a lock if it lasts longer than the configured lifetime. I am a bit worried about the lifetime for a lock being set to short and another thread causing data corruption or race conditions. We would need to redesign the locking call a bit.
Previously, ddaspit (Damien Daspit) wrote…
removed |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. Increased the standard lifetime to 58 seconds. |
72f753e
to
8704dac
Compare
Why are we deleting the job here? I am assuming that it is to correct for incomplete cleanup (the engine was deleted, but the jobs were not). |
So, you did like the Timeout as well? Throwing the action into the lock makes sense. |
You combined lifetime and Timeout into one field? |
Previously, johnml1135 (John Lambert) wrote…
Or, keep trying forever, but when you have it, only hold onto it for the specified time... |
In nearly all of these, there is only one of two database operations being done, already with a transaction. Therefore, the lock is really not relevant. |
Why no longer Async? |
This should be a global using. |
Global using |
if the cancellation token is thrown, the delay task will return - and think that the task timed out. The "RanToCompletion" status appears more reliable. |
Nice catch on the interceptor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 14 files at r3, 16 of 27 files at r7, 1 of 1 files at r8, 1 of 1 files at r10, 37 of 37 files at r11, all commit messages.
Reviewable status: all files reviewed, 17 unresolved discussions (waiting on @ddaspit)
This is still SaveAsync - you need to commit your machine code :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 24 files at r4, 1 of 4 files at r5, 4 of 12 files at r6, 13 of 13 files at r12, all commit messages.
Dismissed @ddaspit from a discussion.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but it isn't clear without further investigation.
Reviewed 27 of 27 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 32 of 37 files at r11, 13 of 13 files at r12, 2 of 2 files at r13, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @johnml1135)
src/Machine/src/Serval.Machine.Shared/Serval.Machine.Shared.csproj
line 49 at r5 (raw file):
Previously, johnml1135 (John Lambert) wrote…
removed
Done
src/Machine/src/Serval.Machine.Shared/Services/BuildJobService.cs
line 109 at r11 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Why are we deleting the job here? I am assuming that it is to correct for incomplete cleanup (the engine was deleted, but the jobs were not).
We are performing the state check and the update as a single atomic operation, since we are no longer locking. In order to do this, we must first create the job, so that we can get the job id. If the engine is in the wrong state, i.e. there is already a build running, then we need to clean up the job that we just created before returning.
src/Machine/src/Serval.Machine.Shared/Services/DistributedReaderWriterLock.cs
line 85 at r11 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Or, keep trying forever, but when you have it, only hold onto it for the specified time...
This isn't the timeout for acquiring a lock. We don't need that, because the request timeout will handle it. This will bail out of the lock once the lifetime has expired. This buys us two things:
- If the task can no longer hold on to the lock, it should stop processing as soon as possible to avoid race conditions or data corruption.
- We can log what method actually had the lock when it expired using the
TimeoutInterceptor
.
src/Machine/src/Serval.Machine.Shared/Services/ITruecaserFactory.cs
line 6 at r11 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Why no longer Async?
All of these async methods were masking the fact that the underlying SMT model calls are not actually async, so it makes it seem like they are cancelable when they actually aren't. I changed all of these SMT calls to synchronous calls, so that it is clear that they can't be canceled.
src/Machine/src/Serval.Machine.Shared/Services/SmtTransferEngineState.cs
line 1 at r11 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This should be a global using.
This was intentional. This using has name conflicts in other classes. By only including it here, we avoid the name conflicts.
src/Machine/src/Serval.Machine.Shared/Services/SmtTransferEngineStateService.cs
line 1 at r11 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Global using
See my other comment on this using.
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/TaskEx.cs
line 18 at r11 (raw file):
Previously, johnml1135 (John Lambert) wrote…
if the cancellation token is thrown, the delay task will return - and think that the task timed out. The "RanToCompletion" status appears more reliable.
If the cancellation token is triggered, then TaskCanceledException
will be thrown, so this check is never called and therefore doesn't matter. Checking delayTask.Status
can result in a race condition. If delay task completes right after the main task completes and before the if
check, it will be incorrectly treated as a timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r13, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
a20f581
to
ffe2da5
Compare
* Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup * Fix some ci-e2e issues Only use locking when accessing SMT model Fix unit tests Update to latest version of Machine Fix bug where wrong id is used when starting a build Remove reference to Serval.Shared in Serval.Machine.Shared
ffe2da5
to
ecde716
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r14, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
Fix #464
This change is