Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ODS-6319] Modify retry policy for deadlocks to wrap just the NHibernate persistence operations #1013

Merged
merged 4 commits into from
Apr 26, 2024

Conversation

gmcelhanon
Copy link
Contributor

No description provided.

Copy link
Contributor

@axelmarquezh axelmarquezh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes, we will no longer execute the whole pipeline when a deadlock occurs, meaning that some scenarios will produce a different result.

For example, 2 clients try to insert a resource with the same id; for whatever reason, one client gets deadlocked, so it retries. Before this change, the pipeline would trigger an update; after this change, it will fail with a duplicate key violation because it retries the insert.

Another example is when a client tries to update a resource that another client deleted. Before this change, it would return a 404 on the retry; after this change, it will return a 204.

I'm not sure how often deadlocks happen in the field, and what scenarios cause them, so your change might be beneficial for most of the scenarios that actually happen.

Thoughts?

…plementations.

# Conflicts:
#	Application/EdFi.Ods.Api/EdFi.Ods.Api.csproj
#	Application/EdFi.Ods.Common/EdFi.Ods.Common.csproj
#	Application/EdFi.Ods.Common/Infrastructure/Repositories/CreateEntity.cs
#	Application/EdFi.Ods.Common/Infrastructure/Repositories/UpdateEntity.cs
@gmcelhanon
Copy link
Contributor Author

gmcelhanon commented Apr 25, 2024

With these changes, we will no longer execute the whole pipeline when a deadlock occurs, meaning that some scenarios will produce a different result.

For example, 2 clients try to insert a resource with the same id; for whatever reason, one client gets deadlocked, so it retries. Before this change, the pipeline would trigger an update; after this change, it will fail with a duplicate key violation because it retries the insert.

Another example is when a client tries to update a resource that another client deleted. Before this change, it would return a 404 on the retry; after this change, it will return a 204.

I'm not sure how often deadlocks happen in the field, and what scenarios cause them, so your change might be beneficial for most of the scenarios that actually happen.

Thoughts?

I think I would characterize what you are describing here as race conditions rather than database deadlocks. The scenario this code is explicitly handling is the latter. Deadlocks occur when two different transactions are executing and they are each waiting on the other to release locks on resources and the database engine detects this and arbitrarily kills one of the transactions, which results in a very specific type of exception.

I think the scenarios you're describing above would mostly likely surface as regular SQL exceptions rather than deadlocks (e.g. INSERT failed due to attempt to insert a duplicate key, or NHibernate detecting that the UPDATE failed because no records were modified).

As for why we're seeing deadlocks (with TEA's implementation), there's still more work to be done to analyze that. There's a separate ticket to drop the indexes from the authorization views we have in SQL Server because this has been known to cause problems: https://dba.stackexchange.com/questions/151049/how-to-resolve-sql-server-deadlocks-involving-concurrent-inserts/151442#151442

Since I believe that we're dealing with scenarios where the deadlocks are happening because of locks on related resources, and not directly related to the scenarios you listed above, it is better to just retry the current database operation than it would be to go back through the whole "upsert" process and run all the queries to load the resource's data back into memory, perform the separate authorization round-trip query again, and finally proceed with (probably) the exact same persistence operation again. In all likelihood, this just represents a bunch of redundant work that increases the load on the server at a time when we know it is struggling with the volume of transactions being executed.

Does that answer satisfy your concerns?

@axelmarquezh axelmarquezh merged commit 13765ad into main Apr 26, 2024
16 checks passed
@axelmarquezh axelmarquezh deleted the ODS-6319 branch April 26, 2024 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants