Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakes #11 #7893

Merged
merged 7 commits into from
Dec 20, 2024
Merged

Conversation

rustyrussell
Copy link
Contributor

Changelog-None: just tests.

@rustyrussell rustyrussell added this to the v25.02 milestone Dec 2, 2024
@cdecker
Copy link
Member

cdecker commented Dec 6, 2024

Uhm, I'm not a fan of disabling force-flaky in CI: we know we have flaky tests, despite all of our efforts to fix them up. Having flaky tests means that developers working on unrelated things get hit with random CI failures, which is confusing and can lead to lengthy, not productive attempts to debug, when it isn't their fault, ultimately leading to a frustrating experience for everyone involved.

I have started traking flakiness at http://35.239.136.52:3170/run, and was hoping to close the loop by having issues being created for tests that are exceedingly flaky, and assign them to the author that owns most of its lines in git-blame. But alas its one of those ambitious sideprojects that never get to where they were intended to go

Anyway, please reconsider removing --force-flaky from CI as I think the knock-on effects of an elevated flaky test failure rate are quite damaging.

This should help us fix the flakes.  Dealing with them was a major
headache at the end of the last release (and they covered up real bugs!).

Signed-off-by: Rusty Russell <[email protected]>
The lifetime is bound to the response, but memleak gets upset if it
sees it.  Use the standard take() pattern insteas.

```
lightningd-5 2024-11-28T05:02:27.503Z **BROKEN** lightningd: MEMLEAK: 0x55b58f329338
lightningd-5 2024-11-28T05:02:27.503Z **BROKEN** lightningd:   label=common/utils.c:150:char[]
lightningd-5 2024-11-28T05:02:27.504Z **BROKEN** lightningd:   alloc:
lightningd-5 2024-11-28T05:02:27.585Z **BROKEN** lightningd:     ccan/ccan/tal/tal.c:488 (tal_alloc_)
lightningd-5 2024-11-28T05:02:27.586Z **BROKEN** lightningd:     ccan/ccan/tal/tal.c:517 (tal_alloc_arr_)
lightningd-5 2024-11-28T05:02:27.586Z **BROKEN** lightningd:     ccan/ccan/tal/tal.c:861 (tal_dup_)
lightningd-5 2024-11-28T05:02:27.586Z **BROKEN** lightningd:     common/utils.c:150 (utf8_str)
lightningd-5 2024-11-28T05:02:27.589Z **BROKEN** lightningd:     lightningd/datastore.c:23 (json_add_datastore)
lightningd-5 2024-11-28T05:02:27.589Z **BROKEN** lightningd:     lightningd/datastore.c:205 (json_datastore)
lightningd-5 2024-11-28T05:02:27.590Z **BROKEN** lightningd:     lightningd/jsonrpc.c:808 (command_exec)
lightningd-5 2024-11-28T05:02:27.590Z **BROKEN** lightningd:     lightningd/jsonrpc.c:954 (rpc_command_hook_final)
lightningd-5 2024-11-28T05:02:27.590Z **BROKEN** lightningd:     lightningd/plugin_hook.c:196 (plugin_hook_call_next)
lightningd-5 2024-11-28T05:02:27.591Z **BROKEN** lightningd:     lightningd/plugin_hook.c:183 (plugin_hook_callback)
lightningd-5 2024-11-28T05:02:27.593Z **BROKEN** lightningd:     lightningd/plugin.c:663 (plugin_response_handle)
lightningd-5 2024-11-28T05:02:27.593Z **BROKEN** lightningd:     lightningd/plugin.c:775 (plugin_read_json_one)
```

Signed-off-by: Rusty Russell <[email protected]>
It's possible that listchannels doesn't show the channel yet:

```
    
        l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
        l2.rpc.connect(l3.info['id'], 'localhost', l3.port)
    
        scid1, _ = l1.fundchannel(l2, 10**6)
        scid2, _ = l2.fundchannel(l3, 10**6)
    
        mine_funding_to_announce(bitcoind, [l1, l2, l3])
>       l1_initial_cupdate_timestamp = only_one(l1.rpc.listchannels(source=l1.info['id'])['channels'])['last_update']

tests/test_gossip.py:43: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

arr = []

    def only_one(arr):
        """Many JSON RPC calls return an array; often we only expect a single entry
        """
>       assert len(arr) == 1
E       AssertionError
```

Signed-off-by: Rusty Russell <[email protected]>
We can get the reply_short_channel_ids_end in the messages when
we make a query:

```
2024-11-29T07:39:28.8550652Z         time_fast = time.time() - start_fast
2024-11-29T07:39:28.8551067Z         assert time_fast < 2
2024-11-29T07:39:28.8551487Z         out3 = [m for m in out3 if not m.startswith(b'0109')]
2024-11-29T07:39:28.8552158Z >       assert set(out1) == set(out3)
...
2024-11-29T07:39:28.8675516Z E         Extra items in the right set:
2024-11-29T07:39:28.8675887Z E         b'010606226e46111a0b59caaf126043eb5bbf28c34f3a5e332a1fc7b2b73cf188910f01'
```

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell
Copy link
Contributor Author

rustyrussell commented Dec 18, 2024

Uhm, I'm not a fan of disabling force-flaky in CI: we know we have flaky tests, despite all of our efforts to fix them up. Having flaky tests means that developers working on unrelated things get hit with random CI failures, which is confusing and can lead to lengthy, not productive attempts to debug, when it isn't their fault, ultimately leading to a frustrating experience for everyone involved.

Unfortunately, experience has shown this to be wrong. The failures build up, and get covered up by flaky. They creep in. Everyone hits retry.

By release time, it becomes chronic. No CI passes. But the cause is now far removed: many of the flakes are caused by month-old commits. I spend days going through the CI failures, and finding real bugs. Often I'm not the code author, so it takes even more work.

This is what happens in practice, and I'm sick of it. Instead, let's try failing early, and we can hopefully get people to fix their tests.

Instead of ignoring CI like we do today :(

Keep a proper cache of all possible ones.  I think this may be the
timeout problem: according to the logs, channeld_fakenet stops responding
and thus HTLCs eventually time out.

```
```
2024-12-16T23:16:16.4874420Z lightningd-1 2024-12-16T22:45:14.068Z UNUSUAL 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: Adding HTLC 18446744073709551615 too slow: killing connection
```

Signed-off-by: Rusty Russell <[email protected]>
Make sure balancing payment is fully cleared before trying to get a routeL

```
    def test_penalty_htlc_tx_fulfill(node_factory, bitcoind, chainparams, anchors):

        # now we send one 'sticky' htlc: l4->l1
        amt = 10**8 // 2
        sticky_inv = l1.rpc.invoice(amt, '2', 'sticky')
>       route = l4.rpc.getroute(l1.info['id'], amt, 1)['route']

tests/test_closing.py:1232:

>           raise RpcError(method, payload, resp['error'])
E           pyln.client.lightning.RpcError: RPC call failed: method: getroute, payload: {'id': '0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518', 'amount_msat': 50000000, 'riskfactor': 1, 'cltv': 9}, error: {'code': 205, 'message': 'Could not find a route'}
```
Just because we've seen the block doesn't mean onchaind has finished
starting up.

```
 _____________________________ test_restorefrompeer _____________________________
[gw0] linux -- Python 3.10.15 /home/runner/.cache/pypoetry/virtualenvs/cln-meta-project-AqJ9wMix-py3.10/bin/python

node_factory = <pyln.testing.utils.NodeFactory object at 0x7fb8f3887f70>
bitcoind = <pyln.testing.utils.BitcoinD object at 0x7fb8f3886f50>

    @unittest.skipIf(os.getenv('TEST_DB_PROVIDER', 'sqlite3') != 'sqlite3', "deletes database, which is assumed sqlite3")
    def test_restorefrompeer(node_factory, bitcoind):
        """
        Test restorefrompeer
        """
        l1, l2 = node_factory.get_nodes(2, [{'broken_log': 'ERROR: Unknown commitment #.*, recovering our funds!',
                                             'experimental-peer-storage': None,
                                             'may_reconnect': True,
                                             'allow_bad_gossip': True},
                                            {'experimental-peer-storage': None,
                                             'may_reconnect': True}])
    
        l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
    
        c12, _ = l1.fundchannel(l2, 10**5)
        assert l1.daemon.is_in_log('Peer storage sent!')
        assert l2.daemon.is_in_log('Peer storage sent!')
    
        l1.stop()
        os.unlink(os.path.join(l1.daemon.lightning_dir, TEST_NETWORK, "lightningd.sqlite3"))
    
        l1.start()
        assert l1.daemon.is_in_log('Server started with public key')
    
        # If this happens fast enough, connect fails with "disconnected
        # during connection"
        try:
            l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
        except RpcError as err:
            assert "disconnected during connection" in err.error['message']
    
        l1.daemon.wait_for_log('peer_in WIRE_YOUR_PEER_STORAGE')
    
        assert l1.rpc.restorefrompeer()['stubs'][0] == _['channel_id']
    
        l1.daemon.wait_for_log('peer_out WIRE_ERROR')
        l2.daemon.wait_for_log('State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL')
    
        bitcoind.generate_block(5, wait_for_mempool=1)
        sync_blockheight(bitcoind, [l1, l2])
    
        l1.daemon.wait_for_log(r'All outputs resolved.*')
        wait_for(lambda: l1.rpc.listfunds()["channels"][0]["state"] == "ONCHAIN")
    
        # Check if funds are recovered.
        assert l1.rpc.listfunds()["channels"][0]["state"] == "ONCHAIN"
>       assert l2.rpc.listfunds()["channels"][0]["state"] == "ONCHAIN"
E       AssertionError: assert 'FUNDING_SPEND_SEEN' == 'ONCHAIN'
E         - ONCHAIN
E         + FUNDING_SPEND_SEEN

tests/test_misc.py:3044: AssertionError
```

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell merged commit 98679aa into ElementsProject:master Dec 20, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants