fix(clustering/sync): avoiding long delay caused by race condition #13896

StarlightIbuki · 2024-11-20T08:55:55Z

Summary

It will retry until the version matches

Checklist

The Pull Request has tests
A changelog file has been added to CHANGELOG/unreleased/kong or adding skip-changelog label on PR if unnecessary. README.md
The Pull Request has backports to all the versions it needs to cover
There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix KAG-5857

chronolaw · 2024-11-20T11:23:07Z

kong/clustering/services/sync/rpc.lua

    return nil, err
  end

  return true
 end


+function sync_once(premature, retry_count)


It has the same name as _M:sync_once, should we choice another one?

I think it should be the new sync_hanlder()

No. sync_handler is a name for another function

chronolaw · 2024-11-20T11:26:08Z

Could we write a test case to verify this fix?

kong/clustering/services/sync/rpc.lua

chronolaw · 2024-11-20T12:44:48Z

kong/clustering/services/sync/rpc.lua

    return nil, err
  end

  return true
 end


+function sync_once(premature, retry_count)


I think it should be the new sync_hanlder()

StarlightIbuki · 2024-11-21T07:44:45Z

Could we write a test case to verify this fix?

Not easily. We cannot be sure that the race conditions happen when we test. We could observe whether it relieves the flaky tests.

Fix KAG-5857

chronolaw · 2024-11-21T08:18:01Z

kong/clustering/services/sync/rpc.lua

@@ -164,6 +167,8 @@ function _M:init_dp(manager)

    local lmdb_ver = tonumber(declarative.get_current_hash()) or 0
    if lmdb_ver < version then
+      -- set lastest version to shm
+      kong_shm:set(CLUSTERING_DATA_PLANES_LATEST_VERSION_KEY, version)


We only run sync in one worker, is it necessary to store it in shared memory?

This is not true. @dndx Could you confirm?

We run incremental sync inside all workers, it's just that only one worker can sync at the same time.

chronolaw · 2024-11-21T08:20:03Z

kong/clustering/services/sync/rpc.lua

-
-    return true
-  end)
+  local res, err = concurrency.with_worker_mutex(SYNC_MUTEX_OPTS, do_sync)


Not sure, could we use concurrency.with_coroutine_mutex?

We deign the sync.v2 to work without privileged worker (and worker no.0)

#13896 (comment)

chronolaw · 2024-11-21T08:22:45Z

kong/clustering/services/sync/rpc.lua

    return nil, err
  end

  return true
 end


+function sync_once_impl(premature, retry_count)


Could we use a simple loop? like:

for i = 1, 5 do sync_handler() if updated then break end ngx.sleep(0) end

No. Recreating the timer prevents long-live timer from causing resource leak

pull-request-size bot added the size/M label Nov 20, 2024

github-actions bot added core/clustering cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels Nov 20, 2024

github-actions bot assigned StarlightIbuki Nov 20, 2024

StarlightIbuki requested review from dndx, chronolaw and chobits and removed request for dndx November 20, 2024 08:56

StarlightIbuki added the skip-changelog label Nov 20, 2024

chronolaw changed the title ~~fix(sync): avoiding long delay caused by race condition~~ fix(clustering/sync): avoiding long delay caused by race condition Nov 20, 2024

chronolaw reviewed Nov 20, 2024

View reviewed changes

fix(sync): avoiding long delay caused by race condition

c9f0ab7

Fix KAG-5857

StarlightIbuki force-pushed the fix/sync-race branch from 6ac9004 to c9f0ab7 Compare November 21, 2024 07:45

chronolaw reviewed Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clustering/sync): avoiding long delay caused by race condition #13896

fix(clustering/sync): avoiding long delay caused by race condition #13896

StarlightIbuki commented Nov 20, 2024

chronolaw Nov 20, 2024

chronolaw Nov 20, 2024

StarlightIbuki Nov 21, 2024

chronolaw commented Nov 20, 2024

chronolaw Nov 20, 2024

StarlightIbuki commented Nov 21, 2024

chronolaw Nov 21, 2024

StarlightIbuki Nov 21, 2024

dndx Nov 22, 2024

chronolaw Nov 21, 2024

StarlightIbuki Nov 21, 2024

dndx Nov 22, 2024

chronolaw Nov 21, 2024

StarlightIbuki Nov 21, 2024

fix(clustering/sync): avoiding long delay caused by race condition #13896

Are you sure you want to change the base?

fix(clustering/sync): avoiding long delay caused by race condition #13896

Conversation

StarlightIbuki commented Nov 20, 2024

Summary

Checklist

Issue reference

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chronolaw commented Nov 20, 2024

Choose a reason for hiding this comment

StarlightIbuki commented Nov 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment