-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leader crashing in multi-node multi-threaded executions #5609
Comments
Might this be related to #5511 (comment)? |
It looks like we have growing snapshots, that eventually get so large they cause an allocation failure. With a little more logging to track this, here's a representative passing virtual run:
(About the same size of snapshot every time) vs failing SGX:
(Snapshots get larger over time, and eventually we get an alloc failure between copying a snapshot and the host acking it?) |
Snapshot growth is due to growth of the
(Note that we're I believe this can happen if the submission rate is higher than the signing rate - if we submit at 15 txs/s, sign every 10 tx, but it takes us 1s to sign those 10 tx, then we get an ever-growing queue of things to sign. We wanted to sign every 10, but by the time we try to there are 15 tx we need to sign, and more next time, and more the time after that. So the |
If that's the case, then it's been spotted before in #3871. |
Fixed by #5692 |
Whilst benchmarking against the latest main (a49343a), I get the following error in each test run:
This was using basicperf.py to test a 3 or 5 node service with 10 workers threads each, 6 write clients connected to the primary. Interestingly, I do not get the error when running with 1 node, or 0 workers threads, or just one write client or when using read-only clients but I do see it every time for 3/5 nodes with 6 write clients and 10 worker threads
The text was updated successfully, but these errors were encountered: