Use write-lock in (*TxPriorityQueue).ReapMax funcs #209

sigv · 2024-03-11T21:00:09Z

Describe your changes and provide context

ReapMaxBytesMaxGas and ReapMaxTxs funcs in TxPriorityQueue claim

Transactions returned are not removed from the mempool transaction store or indexes.

However, they use a priority queue to accomplish the claim

Transaction are retrieved in priority order.

This is accomplished by popping all items out of the whole heap, and then pushing then back in sequentially. A copy of the heap cannot be obtained otherwise. Both of the mentioned functions use a read-lock (RLock) when doing this. This results in a potential scenario where multiple executions of the ReapMax can be started in parallel, and both would be popping items out of the priority queue.

In practice, this can be abused by executing the unconfirmed_txs RPC call repeatedly. Based on our observations, running it multiple times per millisecond results in multiple threads picking it up at the same time. Such a scenario can be obtained via the WebSocket interface, and spamming unconfirmed_txs calls there. The behavior that happens is a Panic in WSJSONRPC handler when a queue item unexpectedly disappears for mempool.(*TxPriorityQueue).Swap.
(runtime error: index out of range [0] with length 0)

This can additionally lead to a CONSENSUS FAILURE!!! if the race condition occurs for internal/consensus.(*State).finalizeCommit when it tries to do mempool.(*TxPriorityQueue).RemoveTx, but the ReapMax has already removed all elements from the underlying heap. (runtime error: index out of range [-1])

This commit switches the lock type to a write-lock (Lock) to ensure no parallel modifications take place. This commit additionally updates the tests to allow parallel execution of the func calls in testing, as to prevent regressions (in case someone wants to downgrade the locks without considering the implications from the underlying heap usage).

Testing performed to validate your change

As mentioned, we have modified the tests to allow parallelization via go test -race flag.

We have run a full node with the change for an extended period, and are not able to reproduce the issue as described above.

ReapMaxBytesMaxGas and ReapMaxTxs funcs in TxPriorityQueue claim > Transactions returned are not removed from the mempool transaction > store or indexes. However, they use a priority queue to accomplish the claim > Transaction are retrieved in priority order. This is accomplished by popping all items out of the whole heap, and then pushing then back in sequentially. A copy of the heap cannot be obtained otherwise. Both of the mentioned functions use a read-lock (RLock) when doing this. This results in a potential scenario where multiple executions of the ReapMax can be started in parallel, and both would be popping items out of the priority queue. In practice, this can be abused by executing the `unconfirmed_txs` RPC call repeatedly. Based on our observations, running it multiple times per millisecond results in multiple threads picking it up at the same time. Such a scenario can be obtained via the WebSocket interface, and spamming `unconfirmed_txs` calls there. The behavior that happens is a `Panic in WSJSONRPC handler` when a queue item unexpectedly disappears for `mempool.(*TxPriorityQueue).Swap`. (`runtime error: index out of range [0] with length 0`) This can additionally lead to a `CONSENSUS FAILURE!!!` if the race condition occurs for `internal/consensus.(*State).finalizeCommit` when it tries to do `mempool.(*TxPriorityQueue).RemoveTx`, but the ReapMax has already removed all elements from the underlying heap. (`runtime error: index out of range [-1]`) This commit switches the lock type to a write-lock (Lock) to ensure no parallel modifications take place. This commit additionally updates the tests to allow parallel execution of the func calls in testing, as to prevent regressions (in case someone wants to downgrade the locks without considering the implications from the underlying heap usage).

codecov-commenter · 2024-03-12T02:08:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.06%. Comparing base (8061a47) to head (6787020).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #209      +/-   ##
==========================================
- Coverage   58.10%   58.06%   -0.04%     
==========================================
  Files         249      249              
  Lines       33915    33915              
==========================================
- Hits        19705    19693      -12     
- Misses      12648    12657       +9     
- Partials     1562     1565       +3

Files	Coverage Δ
internal/mempool/mempool.go	`70.05% <100.00%> (ø)`

... and 19 files with indirect coverage changes

This reverts commit 3cc1293.

ReapMaxBytesMaxGas and ReapMaxTxs funcs in TxPriorityQueue claim > Transactions returned are not removed from the mempool transaction > store or indexes. However, they use a priority queue to accomplish the claim > Transaction are retrieved in priority order. This is accomplished by popping all items out of the whole heap, and then pushing then back in sequentially. A copy of the heap cannot be obtained otherwise. Both of the mentioned functions use a read-lock (RLock) when doing this. This results in a potential scenario where multiple executions of the ReapMax can be started in parallel, and both would be popping items out of the priority queue. In practice, this can be abused by executing the `unconfirmed_txs` RPC call repeatedly. Based on our observations, running it multiple times per millisecond results in multiple threads picking it up at the same time. Such a scenario can be obtained via the WebSocket interface, and spamming `unconfirmed_txs` calls there. The behavior that happens is a `Panic in WSJSONRPC handler` when a queue item unexpectedly disappears for `mempool.(*TxPriorityQueue).Swap`. (`runtime error: index out of range [0] with length 0`) This can additionally lead to a `CONSENSUS FAILURE!!!` if the race condition occurs for `internal/consensus.(*State).finalizeCommit` when it tries to do `mempool.(*TxPriorityQueue).RemoveTx`, but the ReapMax has already removed all elements from the underlying heap. (`runtime error: index out of range [-1]`) This commit switches the lock type to a write-lock (Lock) to ensure no parallel modifications take place. This commit additionally updates the tests to allow parallel execution of the func calls in testing, as to prevent regressions (in case someone wants to downgrade the locks without considering the implications from the underlying heap usage).

* add heapIndex with safety check * cleanup * comment out for perf test * add back perf improvement * fix nil test * Use write-lock in (*TxPriorityQueue).ReapMax funcs (#209) ReapMaxBytesMaxGas and ReapMaxTxs funcs in TxPriorityQueue claim > Transactions returned are not removed from the mempool transaction > store or indexes. However, they use a priority queue to accomplish the claim > Transaction are retrieved in priority order. This is accomplished by popping all items out of the whole heap, and then pushing then back in sequentially. A copy of the heap cannot be obtained otherwise. Both of the mentioned functions use a read-lock (RLock) when doing this. This results in a potential scenario where multiple executions of the ReapMax can be started in parallel, and both would be popping items out of the priority queue. In practice, this can be abused by executing the `unconfirmed_txs` RPC call repeatedly. Based on our observations, running it multiple times per millisecond results in multiple threads picking it up at the same time. Such a scenario can be obtained via the WebSocket interface, and spamming `unconfirmed_txs` calls there. The behavior that happens is a `Panic in WSJSONRPC handler` when a queue item unexpectedly disappears for `mempool.(*TxPriorityQueue).Swap`. (`runtime error: index out of range [0] with length 0`) This can additionally lead to a `CONSENSUS FAILURE!!!` if the race condition occurs for `internal/consensus.(*State).finalizeCommit` when it tries to do `mempool.(*TxPriorityQueue).RemoveTx`, but the ReapMax has already removed all elements from the underlying heap. (`runtime error: index out of range [-1]`) This commit switches the lock type to a write-lock (Lock) to ensure no parallel modifications take place. This commit additionally updates the tests to allow parallel execution of the func calls in testing, as to prevent regressions (in case someone wants to downgrade the locks without considering the implications from the underlying heap usage). --------- Co-authored-by: Valters Jansons <[email protected]>

* reformat logs to use simple concatenation with separators (#207) * Use write-lock in (*TxPriorityQueue).ReapMax funcs (#209) ReapMaxBytesMaxGas and ReapMaxTxs funcs in TxPriorityQueue claim > Transactions returned are not removed from the mempool transaction > store or indexes. However, they use a priority queue to accomplish the claim > Transaction are retrieved in priority order. This is accomplished by popping all items out of the whole heap, and then pushing then back in sequentially. A copy of the heap cannot be obtained otherwise. Both of the mentioned functions use a read-lock (RLock) when doing this. This results in a potential scenario where multiple executions of the ReapMax can be started in parallel, and both would be popping items out of the priority queue. In practice, this can be abused by executing the `unconfirmed_txs` RPC call repeatedly. Based on our observations, running it multiple times per millisecond results in multiple threads picking it up at the same time. Such a scenario can be obtained via the WebSocket interface, and spamming `unconfirmed_txs` calls there. The behavior that happens is a `Panic in WSJSONRPC handler` when a queue item unexpectedly disappears for `mempool.(*TxPriorityQueue).Swap`. (`runtime error: index out of range [0] with length 0`) This can additionally lead to a `CONSENSUS FAILURE!!!` if the race condition occurs for `internal/consensus.(*State).finalizeCommit` when it tries to do `mempool.(*TxPriorityQueue).RemoveTx`, but the ReapMax has already removed all elements from the underlying heap. (`runtime error: index out of range [-1]`) This commit switches the lock type to a write-lock (Lock) to ensure no parallel modifications take place. This commit additionally updates the tests to allow parallel execution of the func calls in testing, as to prevent regressions (in case someone wants to downgrade the locks without considering the implications from the underlying heap usage). * Fix root dir for tendermint reindex command (#210) * Replay events during restart to avoid tx missing (#211) --------- Co-authored-by: Denys S <[email protected]> Co-authored-by: Valters Jansons <[email protected]> Co-authored-by: Yiming Zang <[email protected]>

* add heapIndex with safety check * cleanup * comment out for perf test * add back perf improvement * fix nil test * Use write-lock in (*TxPriorityQueue).ReapMax funcs (#209) ReapMaxBytesMaxGas and ReapMaxTxs funcs in TxPriorityQueue claim > Transactions returned are not removed from the mempool transaction > store or indexes. However, they use a priority queue to accomplish the claim > Transaction are retrieved in priority order. This is accomplished by popping all items out of the whole heap, and then pushing then back in sequentially. A copy of the heap cannot be obtained otherwise. Both of the mentioned functions use a read-lock (RLock) when doing this. This results in a potential scenario where multiple executions of the ReapMax can be started in parallel, and both would be popping items out of the priority queue. In practice, this can be abused by executing the `unconfirmed_txs` RPC call repeatedly. Based on our observations, running it multiple times per millisecond results in multiple threads picking it up at the same time. Such a scenario can be obtained via the WebSocket interface, and spamming `unconfirmed_txs` calls there. The behavior that happens is a `Panic in WSJSONRPC handler` when a queue item unexpectedly disappears for `mempool.(*TxPriorityQueue).Swap`. (`runtime error: index out of range [0] with length 0`) This can additionally lead to a `CONSENSUS FAILURE!!!` if the race condition occurs for `internal/consensus.(*State).finalizeCommit` when it tries to do `mempool.(*TxPriorityQueue).RemoveTx`, but the ReapMax has already removed all elements from the underlying heap. (`runtime error: index out of range [-1]`) This commit switches the lock type to a write-lock (Lock) to ensure no parallel modifications take place. This commit additionally updates the tests to allow parallel execution of the func calls in testing, as to prevent regressions (in case someone wants to downgrade the locks without considering the implications from the underlying heap usage). --------- Co-authored-by: Valters Jansons <[email protected]>

* Make ReadMaxTxs atomic (#166) * Support pending transaction in mempool (#169) * fix unconfirmed tx to consider pending txs (#172) * fix pending pop (#173) * add TTL for pending txs (#174) * [EVM] Fix evm pending nonce (#179) * Perf: Increase buffer size for pubsub server to boost performance (#167) * Increase buffer size for pubsub server * Add more timeout for test failure * Add more timeout * Fix test split scripts * Fix test split * Fix unit test * Unit test * Unit test * [P2P] Optimize block pool requester retry and peer pick up logic (#170) * P2P Improvements: Fix block sync reactor and block pool retry logic * Revert "Add event data to result event (#165)" (#176) This reverts commit 72bb29c. * Fix block sync auto restart not working as expected (#175) * Fix edge case for blocksync (#178) * fix evm pending nonce * fix test * deflake a test * de-flake test * Revert "merge main" This reverts commit 58b9424, reversing changes made to 02d1478. * consider keep-in-cache logic when removing from cache * undo test tweaks --------- Co-authored-by: Yiming Zang <[email protected]> Co-authored-by: Jeremy Wei <[email protected]> * Fix bug when popping pending TXs (#188) * Add mempool metrics for number of pending tx and expired txs (#189) * Add metrics for mempool pending transaction size * Add expired tx count metrics * [EVM] Allow multiple txs from same account in a block (#190) * add mempool prioritization with evm nonce * fix priority stability * index fixes * replace with binary search insert * impl binary search * fix removeTx to push next queued evm tx (#191) * fix expire metric (#193) * [EVM] Fix duplicate evm txs from priority queue (#195) * debug duplicate evm tx * add more logs * add some \ns * more logs * fix swap check * add-lockable-reap-by-gas * add invariant checks * fix invariant parenthesis * fix log * remove invalid invariant * fix nonce ordering pain * handle ordering of insert * fix remove * cleanup * fix imports * cleanup * avoid getTransactionByHash(hash) panic due to index * use Key() to compare instead of pointer * [EVM] prevent duplicate txs from getting inserted (#196) * prevent duplicates in mempool * use timestamp in priority queue * [EVM] Add logging for expiration (#198) * add logging for expired txs * cleanup * [EVM] Avoid returning nil transactions on ForEach (#197) * remove heapIndex to avoid nil scenario * avoid returning nil in loop (mimic Peek) * call callback from mempool (#200) * separate limit for pending tx (#202) * Add EVM txs eviction logic (#204) * Fix debug log (#205) * EVM transaction replacement (#206) (#208) * Add heapIndex with safety check (#213) * add heapIndex with safety check * cleanup * comment out for perf test * add back perf improvement * fix nil test * Use write-lock in (*TxPriorityQueue).ReapMax funcs (#209) ReapMaxBytesMaxGas and ReapMaxTxs funcs in TxPriorityQueue claim > Transactions returned are not removed from the mempool transaction > store or indexes. However, they use a priority queue to accomplish the claim > Transaction are retrieved in priority order. This is accomplished by popping all items out of the whole heap, and then pushing then back in sequentially. A copy of the heap cannot be obtained otherwise. Both of the mentioned functions use a read-lock (RLock) when doing this. This results in a potential scenario where multiple executions of the ReapMax can be started in parallel, and both would be popping items out of the priority queue. In practice, this can be abused by executing the `unconfirmed_txs` RPC call repeatedly. Based on our observations, running it multiple times per millisecond results in multiple threads picking it up at the same time. Such a scenario can be obtained via the WebSocket interface, and spamming `unconfirmed_txs` calls there. The behavior that happens is a `Panic in WSJSONRPC handler` when a queue item unexpectedly disappears for `mempool.(*TxPriorityQueue).Swap`. (`runtime error: index out of range [0] with length 0`) This can additionally lead to a `CONSENSUS FAILURE!!!` if the race condition occurs for `internal/consensus.(*State).finalizeCommit` when it tries to do `mempool.(*TxPriorityQueue).RemoveTx`, but the ReapMax has already removed all elements from the underlying heap. (`runtime error: index out of range [-1]`) This commit switches the lock type to a write-lock (Lock) to ensure no parallel modifications take place. This commit additionally updates the tests to allow parallel execution of the func calls in testing, as to prevent regressions (in case someone wants to downgrade the locks without considering the implications from the underlying heap usage). --------- Co-authored-by: Valters Jansons <[email protected]> * Pending Txs Update Condition (#214) * Add metrics for mempool size changes (#220) * [EVM] Adjust locking for replacement (#224) * Remove tx from cache when canAddPendingTx fails (#230) * add tx hash to evm info proto (#231) --------- Co-authored-by: codchen <[email protected]> Co-authored-by: Steven Landers <[email protected]> Co-authored-by: Yiming Zang <[email protected]> Co-authored-by: Jeremy Wei <[email protected]> Co-authored-by: Valters Jansons <[email protected]> Co-authored-by: Kartik Bhat <[email protected]>

* add heapIndex with safety check * cleanup * comment out for perf test * add back perf improvement * fix nil test * Use write-lock in (*TxPriorityQueue).ReapMax funcs (#209) ReapMaxBytesMaxGas and ReapMaxTxs funcs in TxPriorityQueue claim > Transactions returned are not removed from the mempool transaction > store or indexes. However, they use a priority queue to accomplish the claim > Transaction are retrieved in priority order. This is accomplished by popping all items out of the whole heap, and then pushing then back in sequentially. A copy of the heap cannot be obtained otherwise. Both of the mentioned functions use a read-lock (RLock) when doing this. This results in a potential scenario where multiple executions of the ReapMax can be started in parallel, and both would be popping items out of the priority queue. In practice, this can be abused by executing the `unconfirmed_txs` RPC call repeatedly. Based on our observations, running it multiple times per millisecond results in multiple threads picking it up at the same time. Such a scenario can be obtained via the WebSocket interface, and spamming `unconfirmed_txs` calls there. The behavior that happens is a `Panic in WSJSONRPC handler` when a queue item unexpectedly disappears for `mempool.(*TxPriorityQueue).Swap`. (`runtime error: index out of range [0] with length 0`) This can additionally lead to a `CONSENSUS FAILURE!!!` if the race condition occurs for `internal/consensus.(*State).finalizeCommit` when it tries to do `mempool.(*TxPriorityQueue).RemoveTx`, but the ReapMax has already removed all elements from the underlying heap. (`runtime error: index out of range [-1]`) This commit switches the lock type to a write-lock (Lock) to ensure no parallel modifications take place. This commit additionally updates the tests to allow parallel execution of the func calls in testing, as to prevent regressions (in case someone wants to downgrade the locks without considering the implications from the underlying heap usage). --------- Co-authored-by: Valters Jansons <[email protected]>

sigv force-pushed the lock-prioq-reapmax branch from b161df0 to 6787020 Compare March 11, 2024 21:00

philipsu522 approved these changes Mar 12, 2024

View reviewed changes

yzang2019 merged commit 3cc1293 into sei-protocol:main Mar 12, 2024
22 checks passed

sigv deleted the lock-prioq-reapmax branch March 12, 2024 02:12

philipsu522 added a commit that referenced this pull request Mar 20, 2024

Revert "Use write-lock in (*TxPriorityQueue).ReapMax funcs (#209)"

dac62f1

This reverts commit 3cc1293.

stevenlanders mentioned this pull request Mar 21, 2024

Add heapIndex with safety check #213

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use write-lock in (*TxPriorityQueue).ReapMax funcs #209

Use write-lock in (*TxPriorityQueue).ReapMax funcs #209

sigv commented Mar 11, 2024

codecov-commenter commented Mar 12, 2024

Use write-lock in (*TxPriorityQueue).ReapMax funcs #209

Use write-lock in (*TxPriorityQueue).ReapMax funcs #209

Conversation

sigv commented Mar 11, 2024

Describe your changes and provide context

Testing performed to validate your change

codecov-commenter commented Mar 12, 2024

Codecov Report