Fix flakes #11 #7893

rustyrussell · 2024-12-02T04:01:01Z

Changelog-None: just tests.

cdecker · 2024-12-06T11:02:34Z

Uhm, I'm not a fan of disabling force-flaky in CI: we know we have flaky tests, despite all of our efforts to fix them up. Having flaky tests means that developers working on unrelated things get hit with random CI failures, which is confusing and can lead to lengthy, not productive attempts to debug, when it isn't their fault, ultimately leading to a frustrating experience for everyone involved.

I have started traking flakiness at http://35.239.136.52:3170/run, and was hoping to close the loop by having issues being created for tests that are exceedingly flaky, and assign them to the author that owns most of its lines in git-blame. But alas its one of those ambitious sideprojects that never get to where they were intended to go

Anyway, please reconsider removing --force-flaky from CI as I think the knock-on effects of an elevated flaky test failure rate are quite damaging.

This should help us fix the flakes. Dealing with them was a major headache at the end of the last release (and they covered up real bugs!). Signed-off-by: Rusty Russell <[email protected]>

The lifetime is bound to the response, but memleak gets upset if it sees it. Use the standard take() pattern insteas. ``` lightningd-5 2024-11-28T05:02:27.503Z **BROKEN** lightningd: MEMLEAK: 0x55b58f329338 lightningd-5 2024-11-28T05:02:27.503Z **BROKEN** lightningd: label=common/utils.c:150:char[] lightningd-5 2024-11-28T05:02:27.504Z **BROKEN** lightningd: alloc: lightningd-5 2024-11-28T05:02:27.585Z **BROKEN** lightningd: ccan/ccan/tal/tal.c:488 (tal_alloc_) lightningd-5 2024-11-28T05:02:27.586Z **BROKEN** lightningd: ccan/ccan/tal/tal.c:517 (tal_alloc_arr_) lightningd-5 2024-11-28T05:02:27.586Z **BROKEN** lightningd: ccan/ccan/tal/tal.c:861 (tal_dup_) lightningd-5 2024-11-28T05:02:27.586Z **BROKEN** lightningd: common/utils.c:150 (utf8_str) lightningd-5 2024-11-28T05:02:27.589Z **BROKEN** lightningd: lightningd/datastore.c:23 (json_add_datastore) lightningd-5 2024-11-28T05:02:27.589Z **BROKEN** lightningd: lightningd/datastore.c:205 (json_datastore) lightningd-5 2024-11-28T05:02:27.590Z **BROKEN** lightningd: lightningd/jsonrpc.c:808 (command_exec) lightningd-5 2024-11-28T05:02:27.590Z **BROKEN** lightningd: lightningd/jsonrpc.c:954 (rpc_command_hook_final) lightningd-5 2024-11-28T05:02:27.590Z **BROKEN** lightningd: lightningd/plugin_hook.c:196 (plugin_hook_call_next) lightningd-5 2024-11-28T05:02:27.591Z **BROKEN** lightningd: lightningd/plugin_hook.c:183 (plugin_hook_callback) lightningd-5 2024-11-28T05:02:27.593Z **BROKEN** lightningd: lightningd/plugin.c:663 (plugin_response_handle) lightningd-5 2024-11-28T05:02:27.593Z **BROKEN** lightningd: lightningd/plugin.c:775 (plugin_read_json_one) ``` Signed-off-by: Rusty Russell <[email protected]>

It's possible that listchannels doesn't show the channel yet: ``` l1.rpc.connect(l2.info['id'], 'localhost', l2.port) l2.rpc.connect(l3.info['id'], 'localhost', l3.port) scid1, _ = l1.fundchannel(l2, 10**6) scid2, _ = l2.fundchannel(l3, 10**6) mine_funding_to_announce(bitcoind, [l1, l2, l3]) > l1_initial_cupdate_timestamp = only_one(l1.rpc.listchannels(source=l1.info['id'])['channels'])['last_update'] tests/test_gossip.py:43: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ arr = [] def only_one(arr): """Many JSON RPC calls return an array; often we only expect a single entry """ > assert len(arr) == 1 E AssertionError ``` Signed-off-by: Rusty Russell <[email protected]>

We can get the reply_short_channel_ids_end in the messages when we make a query: ``` 2024-11-29T07:39:28.8550652Z time_fast = time.time() - start_fast 2024-11-29T07:39:28.8551067Z assert time_fast < 2 2024-11-29T07:39:28.8551487Z out3 = [m for m in out3 if not m.startswith(b'0109')] 2024-11-29T07:39:28.8552158Z > assert set(out1) == set(out3) ... 2024-11-29T07:39:28.8675516Z E Extra items in the right set: 2024-11-29T07:39:28.8675887Z E b'010606226e46111a0b59caaf126043eb5bbf28c34f3a5e332a1fc7b2b73cf188910f01' ``` Signed-off-by: Rusty Russell <[email protected]>

rustyrussell · 2024-12-18T01:55:04Z

Uhm, I'm not a fan of disabling force-flaky in CI: we know we have flaky tests, despite all of our efforts to fix them up. Having flaky tests means that developers working on unrelated things get hit with random CI failures, which is confusing and can lead to lengthy, not productive attempts to debug, when it isn't their fault, ultimately leading to a frustrating experience for everyone involved.

Unfortunately, experience has shown this to be wrong. The failures build up, and get covered up by flaky. They creep in. Everyone hits retry.

By release time, it becomes chronic. No CI passes. But the cause is now far removed: many of the flakes are caused by month-old commits. I spend days going through the CI failures, and finding real bugs. Often I'm not the code author, so it takes even more work.

This is what happens in practice, and I'm sick of it. Instead, let's try failing early, and we can hopefully get people to fix their tests.

Instead of ignoring CI like we do today :(

Keep a proper cache of all possible ones. I think this may be the timeout problem: according to the logs, channeld_fakenet stops responding and thus HTLCs eventually time out. ``` ``` 2024-12-16T23:16:16.4874420Z lightningd-1 2024-12-16T22:45:14.068Z UNUSUAL 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: Adding HTLC 18446744073709551615 too slow: killing connection ``` Signed-off-by: Rusty Russell <[email protected]>

Make sure balancing payment is fully cleared before trying to get a routeL ``` def test_penalty_htlc_tx_fulfill(node_factory, bitcoind, chainparams, anchors): # now we send one 'sticky' htlc: l4->l1 amt = 10**8 // 2 sticky_inv = l1.rpc.invoice(amt, '2', 'sticky') > route = l4.rpc.getroute(l1.info['id'], amt, 1)['route'] tests/test_closing.py:1232: > raise RpcError(method, payload, resp['error']) E pyln.client.lightning.RpcError: RPC call failed: method: getroute, payload: {'id': '0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518', 'amount_msat': 50000000, 'riskfactor': 1, 'cltv': 9}, error: {'code': 205, 'message': 'Could not find a route'} ```

Just because we've seen the block doesn't mean onchaind has finished starting up. ``` _____________________________ test_restorefrompeer _____________________________ [gw0] linux -- Python 3.10.15 /home/runner/.cache/pypoetry/virtualenvs/cln-meta-project-AqJ9wMix-py3.10/bin/python node_factory = <pyln.testing.utils.NodeFactory object at 0x7fb8f3887f70> bitcoind = <pyln.testing.utils.BitcoinD object at 0x7fb8f3886f50> @unittest.skipIf(os.getenv('TEST_DB_PROVIDER', 'sqlite3') != 'sqlite3', "deletes database, which is assumed sqlite3") def test_restorefrompeer(node_factory, bitcoind): """ Test restorefrompeer """ l1, l2 = node_factory.get_nodes(2, [{'broken_log': 'ERROR: Unknown commitment #.*, recovering our funds!', 'experimental-peer-storage': None, 'may_reconnect': True, 'allow_bad_gossip': True}, {'experimental-peer-storage': None, 'may_reconnect': True}]) l1.rpc.connect(l2.info['id'], 'localhost', l2.port) c12, _ = l1.fundchannel(l2, 10**5) assert l1.daemon.is_in_log('Peer storage sent!') assert l2.daemon.is_in_log('Peer storage sent!') l1.stop() os.unlink(os.path.join(l1.daemon.lightning_dir, TEST_NETWORK, "lightningd.sqlite3")) l1.start() assert l1.daemon.is_in_log('Server started with public key') # If this happens fast enough, connect fails with "disconnected # during connection" try: l1.rpc.connect(l2.info['id'], 'localhost', l2.port) except RpcError as err: assert "disconnected during connection" in err.error['message'] l1.daemon.wait_for_log('peer_in WIRE_YOUR_PEER_STORAGE') assert l1.rpc.restorefrompeer()['stubs'][0] == _['channel_id'] l1.daemon.wait_for_log('peer_out WIRE_ERROR') l2.daemon.wait_for_log('State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL') bitcoind.generate_block(5, wait_for_mempool=1) sync_blockheight(bitcoind, [l1, l2]) l1.daemon.wait_for_log(r'All outputs resolved.*') wait_for(lambda: l1.rpc.listfunds()["channels"][0]["state"] == "ONCHAIN") # Check if funds are recovered. assert l1.rpc.listfunds()["channels"][0]["state"] == "ONCHAIN" > assert l2.rpc.listfunds()["channels"][0]["state"] == "ONCHAIN" E AssertionError: assert 'FUNDING_SPEND_SEEN' == 'ONCHAIN' E - ONCHAIN E + FUNDING_SPEND_SEEN tests/test_misc.py:3044: AssertionError ``` Signed-off-by: Rusty Russell <[email protected]>

rustyrussell added this to the v25.02 milestone Dec 2, 2024

rustyrussell added 4 commits December 17, 2024 16:00

CI: disable --force-flaky.

7601841

This should help us fix the flakes. Dealing with them was a major headache at the end of the last release (and they covered up real bugs!). Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the guilt/fix-flakes11 branch from 652e655 to a5c0c43 Compare December 18, 2024 02:31

rustyrussell force-pushed the guilt/fix-flakes11 branch from 0ca513f to 92168db Compare December 18, 2024 23:38

rustyrussell added 2 commits December 19, 2024 11:23

rustyrussell merged commit 98679aa into ElementsProject:master Dec 20, 2024
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flakes #11 #7893

Fix flakes #11 #7893

rustyrussell commented Dec 2, 2024

cdecker commented Dec 6, 2024

rustyrussell commented Dec 18, 2024 •

edited

Loading

Fix flakes #11 #7893

Fix flakes #11 #7893

Conversation

rustyrussell commented Dec 2, 2024

cdecker commented Dec 6, 2024

rustyrussell commented Dec 18, 2024 • edited Loading

rustyrussell commented Dec 18, 2024 •

edited

Loading