Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader crashing in multi-node multi-threaded executions #5609

Closed
heidihoward opened this issue Sep 1, 2023 · 5 comments
Closed

Leader crashing in multi-node multi-threaded executions #5609

heidihoward opened this issue Sep 1, 2023 · 5 comments
Labels
Milestone

Comments

@heidihoward
Copy link
Member

heidihoward commented Sep 1, 2023

Whilst benchmarking against the latest main (a49343a), I get the following error in each test run:

...
10:55:53.271 | ERROR    | infra.network:log_errors:126 - /home/azureuser/heidi/CCF/build-sgx/ws-2023-09-01_10-45-11-w1-n3-i0/pi_basic_mt_sgx_cft^_common/172.23.0.13_pi_basic_mt_sgx_cft^_0_out:225: 2023-09-01T10:55:44.824797Z        100 [fail ] ../src/host/main.cpp:700             | Exception in ccf::run: Failed to call in enclave_run: OE_ENCLAVE_ABORTING
10:55:53.271 | ERROR    | infra.network:log_errors:126 - /home/azureuser/heidi/CCF/build-sgx/ws-2023-09-01_10-45-11-w1-n3-i0/pi_basic_mt_sgx_cft^_common/172.23.0.13_pi_basic_mt_sgx_cft^_0_out:226: 2023-09-01T10:55:44.825560Z -0.006 9   [fail ] ../src/kv/committable_tx.h:249       | Error during serialisation
10:55:53.271 | ERROR    | infra.network:log_errors:126 - /home/azureuser/heidi/CCF/build-sgx/ws-2023-09-01_10-45-11-w1-n3-i0/pi_basic_mt_sgx_cft^_common/172.23.0.13_pi_basic_mt_sgx_cft^_0_out:227: 2023-09-01T10:55:44.825624Z -0.001 9   [fatal] ../src/node/rpc/frontend.h:645       | Failed to serialise
10:55:53.271 | ERROR    | infra.network:log_errors:126 - /home/azureuser/heidi/CCF/build-sgx/ws-2023-09-01_10-45-11-w1-n3-i0/pi_basic_mt_sgx_cft^_common/172.23.0.13_pi_basic_mt_sgx_cft^_0_out:229: 2023-09-01T10:55:45.391753Z        100 [fail ] ../src/ds/messaging.h:170            | Exception while processing message <AdminMessage::fatal_error_msg:1968599046> of size 24
10:55:53.271 | INFO     | infra.network:log_errors:129 - 4 errors found, printing end of output for context:
10:55:53.271 | INFO     | infra.network:log_errors:133 - 2023-09-01T10:55:41.657350Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_113260_158427.committed [1246836 bytes]
10:55:53.271 | INFO     | infra.network:log_errors:133 - 2023-09-01T10:55:42.415336Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_134341_185992.committed [1402116 bytes]
10:55:53.272 | INFO     | infra.network:log_errors:133 - 2023-09-01T10:55:43.334606Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_159311_216733.committed [1678689 bytes]
10:55:53.272 | INFO     | infra.network:log_errors:133 - 2023-09-01T10:55:44.023709Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_179339_240905.committed [1824844 bytes]
10:55:53.272 | INFO     | infra.network:log_errors:133 - 2023-09-01T10:55:44.813152Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_201901_268837.committed [1907832 bytes]
10:55:53.272 | INFO     | infra.network:log_errors:133 - 2023-09-01T10:55:44.824797Z        100 [fail ] ../src/host/main.cpp:700             | Exception in ccf::run: Failed to call in enclave_run: OE_ENCLAVE_ABORTING
10:55:53.272 | INFO     | infra.network:log_errors:133 - 2023-09-01T10:55:44.825560Z -0.006 9   [fail ] ../src/kv/committable_tx.h:249       | Error during serialisation
10:55:53.272 | INFO     | infra.network:log_errors:133 - 2023-09-01T10:55:44.825624Z -0.001 9   [fatal] ../src/node/rpc/frontend.h:645       | Failed to serialise
10:55:53.272 | INFO     | infra.network:log_errors:133 - 2023-09-01T10:55:45.203088Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_225105_286166.committed [2009537 bytes]
10:55:53.272 | INFO     | infra.network:log_errors:133 - 2023-09-01T10:55:45.391753Z        100 [fail ] ../src/ds/messaging.h:170            | Exception while processing message <AdminMessage::fatal_error_msg:1968599046> of size 24
10:55:53.272 | ERROR    | infra.network:log_errors:146 - Contents of /home/azureuser/heidi/CCF/build-sgx/ws-2023-09-01_10-45-11-w1-n3-i0/pi_basic_mt_sgx_cft^_common/172.23.0.13_pi_basic_mt_sgx_cft^_0_err:
cxxrt::bad_alloc
libc++abi: terminating due to uncaught exception of type std::exception: std::exception

This was using basicperf.py to test a 3 or 5 node service with 10 workers threads each, 6 write clients connected to the primary. Interestingly, I do not get the error when running with 1 node, or 0 workers threads, or just one write client or when using read-only clients but I do see it every time for 3/5 nodes with 6 write clients and 10 worker threads

@heidihoward
Copy link
Member Author

Might this be related to #5511 (comment)?

@heidihoward heidihoward changed the title Error during serialisation Leader crashing in multi-node multi-threaded executions - Error during serialisation Sep 1, 2023
@eddyashton
Copy link
Member

It looks like we have growing snapshots, that eventually get so large they cause an allocation failure. With a little more logging to track this, here's a representative passing virtual run:

2023-09-01T15:49:51.709554Z -0.003 2   [info ] ../src/node/snapshotter.h:183        | Request to allocate snapshot [73733 bytes] for seqno 17, with evidence seqno 18: <sha256 3d8b923f7571acc4392a52ea7abfb817f7a83ef1826495f2f8c7f299531b0495>, ws digest: <sha256 922e4e5228858a4b51b8b53ead77f9430ba60ce62d2e650dcc89513ca4f6854a>
2023-09-01T15:49:51.709564Z -0.001 0   [info ] ../src/node/snapshotter.h:280        | Allocated a 73733 byte snapshot buffer - attempting to write
2023-09-01T15:49:51.709568Z -0.001 0   [info ] ../src/node/snapshotter.h:328        | Successfully copied snapshot at seqno 17 to host memory [73733 bytes]
2023-09-01T15:49:51.892282Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_17_18.committed [75108 bytes]
...
2023-09-01T15:49:56.428772Z -0.001 1   [info ] ../src/node/snapshotter.h:183        | Request to allocate snapshot [313889 bytes] for seqno 203738, with evidence seqno 211685: <sha256 be0c2ce669ff54fe6690781c04968fe9e0e22dcf00ae211f17460f0d54ccbaf8>, ws digest: <sha256 79ce07e631f0896b8b32929ab030055340c9ef52d2bc29ec540cf9778fb3b304>
2023-09-01T15:49:56.429097Z -0.001 0   [info ] ../src/node/snapshotter.h:280        | Allocated a 313889 byte snapshot buffer - attempting to write
2023-09-01T15:49:56.429132Z -0.001 0   [info ] ../src/node/snapshotter.h:328        | Successfully copied snapshot at seqno 203738 to host memory [313889 bytes]
2023-09-01T15:49:56.469794Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_203738_211685.committed [316414 bytes]
...
2023-09-01T15:49:57.823764Z -0.001 8   [info ] ../src/node/snapshotter.h:183        | Request to allocate snapshot [314049 bytes] for seqno 365075, with evidence seqno 370815: <sha256 201c4e77c9bde87205514c01b7a43fb97f85a1528fe1a8167ed21765c9b0c164>, ws digest: <sha256 079ef77ae0d15226d374f4ce9d0700f7827cb227e7a045d8e59e311aa7c6a6cf>
2023-09-01T15:49:57.824385Z -0.002 0   [info ] ../src/node/snapshotter.h:280        | Allocated a 314049 byte snapshot buffer - attempting to write
2023-09-01T15:49:57.824416Z -0.002 0   [info ] ../src/node/snapshotter.h:328        | Successfully copied snapshot at seqno 365075 to host memory [314049 bytes]
2023-09-01T15:49:57.829375Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_365075_370815.committed [316338 bytes]
...
2023-09-01T15:49:59.912031Z -0.002 8   [info ] ../src/node/snapshotter.h:183        | Request to allocate snapshot [314017 bytes] for seqno 588619, with evidence seqno 589109: <sha256 f4db39c416ac1558cb2ee489f4dc499a2ba4026fb171004fa1fad28c03a65de7>, ws digest: <sha256 e0a757c24ae8f8ddc16d3737927ab4f06fabda22ddcf66a1d607eba5087dbbea>
2023-09-01T15:49:59.912238Z -0.002 0   [info ] ../src/node/snapshotter.h:280        | Allocated a 314017 byte snapshot buffer - attempting to write
2023-09-01T15:49:59.912257Z -0.002 0   [info ] ../src/node/snapshotter.h:328        | Successfully copied snapshot at seqno 588619 to host memory [314017 bytes]
2023-09-01T15:49:59.941111Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_588619_589109.committed [316618 bytes]

(About the same size of snapshot every time)

vs failing SGX:

2023-09-01T15:53:49.937177Z -0.003 2   [info ] ../src/node/snapshotter.h:183        | Request to allocate snapshot [144277 bytes] for seqno 19, with evidence seqno 20: <sha256 33a50be57292a46ce7c00d195d1df3973bbd3bf735d28168c49ec539bf49faa9>, ws digest: <sha256 d1fdede8dc0553ee2dfae3d650e9a5434caa44881202728cab6cb385c9d34517>
2023-09-01T15:53:49.937200Z -0.001 0   [info ] ../src/node/snapshotter.h:280        | Allocated a 144277 byte snapshot buffer - attempting to write
2023-09-01T15:53:49.938266Z -0.002 0   [info ] ../src/node/snapshotter.h:328        | Successfully copied snapshot at seqno 19 to host memory [144277 bytes]
2023-09-01T15:53:50.135726Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_19_20.committed [145648 bytes]
...
2023-09-01T15:53:57.457285Z -0.006 10  [info ] ../src/node/snapshotter.h:183        | Request to allocate snapshot [1756081 bytes] for seqno 174740, with evidence seqno 237098: <sha256 c83a899c77d1a9ef5a40e1ea7d692e759a36bcea11a06c5326817a635d25e9f3>, ws digest: <sha256 2d84eb3fc8b5c423521b6abb3c4854fb11e2174376da45f3416503ae64d83203>
2023-09-01T15:53:57.461369Z -0.010 0   [info ] ../src/node/snapshotter.h:280        | Allocated a 1756081 byte snapshot buffer - attempting to write
2023-09-01T15:53:57.461610Z -0.010 0   [info ] ../src/node/snapshotter.h:328        | Successfully copied snapshot at seqno 174740 to host memory [1756081 bytes]
2023-09-01T15:53:57.729807Z        100 [info ] ../src/host/snapshots.h:280          | New snapshot file written to snapshot_131888_185101.committed [1587248 bytes]
...
2023-09-01T15:53:59.286355Z -0.003 3   [info ] ../src/node/snapshotter.h:183        | Request to allocate snapshot [2236401 bytes] for seqno 247102, with evidence seqno 311572: <sha256 da1d57f316e676d5f37a4e8b55a620f70cd5d192b3db0d65fc0ed286298ba24e>, ws digest: <sha256 8104678def969361720769730ede2a01a8ef356ee79b9420eec769b777259aed>
2023-09-01T15:53:59.286820Z -0.001 0   [info ] ../src/node/snapshotter.h:280        | Allocated a 2236401 byte snapshot buffer - attempting to write
2023-09-01T15:53:59.286930Z -0.001 0   [info ] ../src/node/snapshotter.h:328        | Successfully copied snapshot at seqno 247102 to host memory [2236401 bytes]
2023-09-01T15:53:59+0000.626571Z [(H)ERROR] tid(0x7f36e4ffa700) | Unhandled in-enclave exception. To get more information, configure the enclave with CapturePFGPExceptions=1 and enable the in-enclave logging. [/source/openenclave/host/sgx/linux/vdso.c:oe_vdso_user_handler:169]
2023-09-01T15:53:59+0000.626588Z [(H)ERROR] tid(0x7f36e4ffa700) | :OE_FAILURE [/source/openenclave/host/sgx/linux/vdso.c:oe_vdso_enter:247]
2023-09-01T15:53:59+0000.626591Z [(H)ERROR] tid(0x7f36e4ffa700) | :OE_FAILURE [/source/openenclave/host/sgx/calls.c:_do_eenter:201]
2023-09-01T15:53:59+0000.626592Z [(H)ERROR] tid(0x7f36e4ffa700) | :OE_FAILURE [/source/openenclave/host/sgx/calls.c:oe_ecall:631]
2023-09-01T15:53:59+0000.626594Z [(H)ERROR] tid(0x7f36e4ffa700) | :OE_FAILURE [/source/openenclave/host/calls.c:_call_enclave_function_impl:55]
2023-09-01T15:53:59.626699Z        100 [fail ] ../src/host/main.cpp:700             | Exception in ccf::run: Failed to call in enclave_run: OE_FAILURE

(Snapshots get larger over time, and eventually we get an alloc failure between copying a snapshot and the host acking it?)

@eddyashton
Copy link
Member

Snapshot growth is due to growth of the internal.tree table that's part of signature transactions:

$ ls -alh build.virtual/workspace/pi_basic_mt_virtual_cft^_0/0.snapshots/snapshot_588619_589109.committed
-rw-rw-r-- 1 azureuser azureuser 310K Sep  1 15:49 'build.virtual/workspace/pi_basic_mt_virtual_cft^_0/0.snapshots/snapshot_588619_589109.committed'
$ python3 ~/.local/bin/read_ledger.py -s build.virtual/workspace/pi_basic_mt_virtual_cft^_0/0.snapshots/snapshot_588619_589109.committed -t ".*internal.tree.*" | wc -c
460170

$ ls -alh build.sgx/workspace/pi_basic_mt_sgx_cft^_0/0.snapshots/snapshot_526026_575190.committed
-rw-rw-r-- 1 azureuser azureuser 2.1M Sep  1 16:05 'build.sgx/workspace/pi_basic_mt_sgx_cft^_0/0.snapshots/snapshot_526026_575190.committed'
$ python3 ~/.local/bin/read_ledger.py -s build.sgx/workspace/pi_basic_mt_sgx_cft^_0/0.snapshots/snapshot_526026_575190.committed -t ".*internal.tree.*" | wc -c
5429438

(Note that we're wcing the output of read_ledger's dump of the table, which is a Python bytes object, so we have a lot of \x1234 in it, and its not a count of the actual number of bytes in the serialised table. But scales similarly)

I believe this can happen if the submission rate is higher than the signing rate - if we submit at 15 txs/s, sign every 10 tx, but it takes us 1s to sign those 10 tx, then we get an ever-growing queue of things to sign. We wanted to sign every 10, but by the time we try to there are 15 tx we need to sign, and more next time, and more the time after that. So the internal.tree table grows. We don't have any good signals for when this overload happens, and I don't know what's changed to cause the existing perf test to produce overload recently (apart from, probably, OpenSSL 3).

@heidihoward heidihoward changed the title Leader crashing in multi-node multi-threaded executions - Error during serialisation Leader crashing in multi-node multi-threaded executions Sep 1, 2023
@achamayou
Copy link
Member

If that's the case, then it's been spotted before in #3871.

@achamayou achamayou added the bug label Sep 8, 2023
@achamayou achamayou added this to the 4.x milestone Sep 8, 2023
@achamayou
Copy link
Member

Fixed by #5692

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants