-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flakes #11 #7893
Fix flakes #11 #7893
Conversation
Uhm, I'm not a fan of disabling force-flaky in CI: we know we have flaky tests, despite all of our efforts to fix them up. Having flaky tests means that developers working on unrelated things get hit with random CI failures, which is confusing and can lead to lengthy, not productive attempts to debug, when it isn't their fault, ultimately leading to a frustrating experience for everyone involved. I have started traking flakiness at http://35.239.136.52:3170/run, and was hoping to close the loop by having issues being created for tests that are exceedingly flaky, and assign them to the author that owns most of its lines in Anyway, please reconsider removing |
This should help us fix the flakes. Dealing with them was a major headache at the end of the last release (and they covered up real bugs!). Signed-off-by: Rusty Russell <[email protected]>
The lifetime is bound to the response, but memleak gets upset if it sees it. Use the standard take() pattern insteas. ``` lightningd-5 2024-11-28T05:02:27.503Z **BROKEN** lightningd: MEMLEAK: 0x55b58f329338 lightningd-5 2024-11-28T05:02:27.503Z **BROKEN** lightningd: label=common/utils.c:150:char[] lightningd-5 2024-11-28T05:02:27.504Z **BROKEN** lightningd: alloc: lightningd-5 2024-11-28T05:02:27.585Z **BROKEN** lightningd: ccan/ccan/tal/tal.c:488 (tal_alloc_) lightningd-5 2024-11-28T05:02:27.586Z **BROKEN** lightningd: ccan/ccan/tal/tal.c:517 (tal_alloc_arr_) lightningd-5 2024-11-28T05:02:27.586Z **BROKEN** lightningd: ccan/ccan/tal/tal.c:861 (tal_dup_) lightningd-5 2024-11-28T05:02:27.586Z **BROKEN** lightningd: common/utils.c:150 (utf8_str) lightningd-5 2024-11-28T05:02:27.589Z **BROKEN** lightningd: lightningd/datastore.c:23 (json_add_datastore) lightningd-5 2024-11-28T05:02:27.589Z **BROKEN** lightningd: lightningd/datastore.c:205 (json_datastore) lightningd-5 2024-11-28T05:02:27.590Z **BROKEN** lightningd: lightningd/jsonrpc.c:808 (command_exec) lightningd-5 2024-11-28T05:02:27.590Z **BROKEN** lightningd: lightningd/jsonrpc.c:954 (rpc_command_hook_final) lightningd-5 2024-11-28T05:02:27.590Z **BROKEN** lightningd: lightningd/plugin_hook.c:196 (plugin_hook_call_next) lightningd-5 2024-11-28T05:02:27.591Z **BROKEN** lightningd: lightningd/plugin_hook.c:183 (plugin_hook_callback) lightningd-5 2024-11-28T05:02:27.593Z **BROKEN** lightningd: lightningd/plugin.c:663 (plugin_response_handle) lightningd-5 2024-11-28T05:02:27.593Z **BROKEN** lightningd: lightningd/plugin.c:775 (plugin_read_json_one) ``` Signed-off-by: Rusty Russell <[email protected]>
It's possible that listchannels doesn't show the channel yet: ``` l1.rpc.connect(l2.info['id'], 'localhost', l2.port) l2.rpc.connect(l3.info['id'], 'localhost', l3.port) scid1, _ = l1.fundchannel(l2, 10**6) scid2, _ = l2.fundchannel(l3, 10**6) mine_funding_to_announce(bitcoind, [l1, l2, l3]) > l1_initial_cupdate_timestamp = only_one(l1.rpc.listchannels(source=l1.info['id'])['channels'])['last_update'] tests/test_gossip.py:43: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ arr = [] def only_one(arr): """Many JSON RPC calls return an array; often we only expect a single entry """ > assert len(arr) == 1 E AssertionError ``` Signed-off-by: Rusty Russell <[email protected]>
We can get the reply_short_channel_ids_end in the messages when we make a query: ``` 2024-11-29T07:39:28.8550652Z time_fast = time.time() - start_fast 2024-11-29T07:39:28.8551067Z assert time_fast < 2 2024-11-29T07:39:28.8551487Z out3 = [m for m in out3 if not m.startswith(b'0109')] 2024-11-29T07:39:28.8552158Z > assert set(out1) == set(out3) ... 2024-11-29T07:39:28.8675516Z E Extra items in the right set: 2024-11-29T07:39:28.8675887Z E b'010606226e46111a0b59caaf126043eb5bbf28c34f3a5e332a1fc7b2b73cf188910f01' ``` Signed-off-by: Rusty Russell <[email protected]>
Unfortunately, experience has shown this to be wrong. The failures build up, and get covered up by flaky. They creep in. Everyone hits retry. By release time, it becomes chronic. No CI passes. But the cause is now far removed: many of the flakes are caused by month-old commits. I spend days going through the CI failures, and finding real bugs. Often I'm not the code author, so it takes even more work. This is what happens in practice, and I'm sick of it. Instead, let's try failing early, and we can hopefully get people to fix their tests. Instead of ignoring CI like we do today :( |
652e655
to
a5c0c43
Compare
Keep a proper cache of all possible ones. I think this may be the timeout problem: according to the logs, channeld_fakenet stops responding and thus HTLCs eventually time out. ``` ``` 2024-12-16T23:16:16.4874420Z lightningd-1 2024-12-16T22:45:14.068Z UNUSUAL 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: Adding HTLC 18446744073709551615 too slow: killing connection ``` Signed-off-by: Rusty Russell <[email protected]>
0ca513f
to
92168db
Compare
Make sure balancing payment is fully cleared before trying to get a routeL ``` def test_penalty_htlc_tx_fulfill(node_factory, bitcoind, chainparams, anchors): # now we send one 'sticky' htlc: l4->l1 amt = 10**8 // 2 sticky_inv = l1.rpc.invoice(amt, '2', 'sticky') > route = l4.rpc.getroute(l1.info['id'], amt, 1)['route'] tests/test_closing.py:1232: > raise RpcError(method, payload, resp['error']) E pyln.client.lightning.RpcError: RPC call failed: method: getroute, payload: {'id': '0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518', 'amount_msat': 50000000, 'riskfactor': 1, 'cltv': 9}, error: {'code': 205, 'message': 'Could not find a route'} ```
Just because we've seen the block doesn't mean onchaind has finished starting up. ``` _____________________________ test_restorefrompeer _____________________________ [gw0] linux -- Python 3.10.15 /home/runner/.cache/pypoetry/virtualenvs/cln-meta-project-AqJ9wMix-py3.10/bin/python node_factory = <pyln.testing.utils.NodeFactory object at 0x7fb8f3887f70> bitcoind = <pyln.testing.utils.BitcoinD object at 0x7fb8f3886f50> @unittest.skipIf(os.getenv('TEST_DB_PROVIDER', 'sqlite3') != 'sqlite3', "deletes database, which is assumed sqlite3") def test_restorefrompeer(node_factory, bitcoind): """ Test restorefrompeer """ l1, l2 = node_factory.get_nodes(2, [{'broken_log': 'ERROR: Unknown commitment #.*, recovering our funds!', 'experimental-peer-storage': None, 'may_reconnect': True, 'allow_bad_gossip': True}, {'experimental-peer-storage': None, 'may_reconnect': True}]) l1.rpc.connect(l2.info['id'], 'localhost', l2.port) c12, _ = l1.fundchannel(l2, 10**5) assert l1.daemon.is_in_log('Peer storage sent!') assert l2.daemon.is_in_log('Peer storage sent!') l1.stop() os.unlink(os.path.join(l1.daemon.lightning_dir, TEST_NETWORK, "lightningd.sqlite3")) l1.start() assert l1.daemon.is_in_log('Server started with public key') # If this happens fast enough, connect fails with "disconnected # during connection" try: l1.rpc.connect(l2.info['id'], 'localhost', l2.port) except RpcError as err: assert "disconnected during connection" in err.error['message'] l1.daemon.wait_for_log('peer_in WIRE_YOUR_PEER_STORAGE') assert l1.rpc.restorefrompeer()['stubs'][0] == _['channel_id'] l1.daemon.wait_for_log('peer_out WIRE_ERROR') l2.daemon.wait_for_log('State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL') bitcoind.generate_block(5, wait_for_mempool=1) sync_blockheight(bitcoind, [l1, l2]) l1.daemon.wait_for_log(r'All outputs resolved.*') wait_for(lambda: l1.rpc.listfunds()["channels"][0]["state"] == "ONCHAIN") # Check if funds are recovered. assert l1.rpc.listfunds()["channels"][0]["state"] == "ONCHAIN" > assert l2.rpc.listfunds()["channels"][0]["state"] == "ONCHAIN" E AssertionError: assert 'FUNDING_SPEND_SEEN' == 'ONCHAIN' E - ONCHAIN E + FUNDING_SPEND_SEEN tests/test_misc.py:3044: AssertionError ``` Signed-off-by: Rusty Russell <[email protected]>
Changelog-None: just tests.