Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boltz: Validate initial claim destination address #246

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

s373nZ
Copy link

@s373nZ s373nZ commented Oct 21, 2024

Tries to prevent the following error in the logs:

Unhandled exception in concurrent task!
lightningd[181525]: Bitcoin::UnknownAddrType

In my case, the database table BoltzServiceFactory_rsub has four entries, all of which contain empty strings for the destinationAddress field. These blank fields are predictably un-parsable as Bitcoin addresses.

Without the validation, the empty addresses get passed in to building an initial claim tx here and an error is thrown here.

I have next to 0 experience with clboss code and am no C++ expert by any means. Submitting this as a PoC / draft PR for a quick glance and if it's on the right track, I'll attempt to finish the address parsing validation.

@s373nZ
Copy link
Author

s373nZ commented Oct 25, 2024

After a bit more research on the code and my DB state, I see there is a more robust system around clboss creating the swaps. Particularly, I've been trying to understand more about the SwapManager. To note, and possibly related to this issue are the facts that:

  • SELECT * FROM SwapManager; return one row, with state=2 (enum AwaitingResult), but has no value initialized for the address column.
  • SELECT * FROM SwapManager_addrcache; returns a total count of 95 rows, all except one have no values for the address field.

So... perhaps there is also an issue with address generation prior to the validation crash, as well.

@s373nZ
Copy link
Author

s373nZ commented Oct 25, 2024

Changed the query to exclude loading empty addresses from the address cache. An additional idea might be to run a quick SQL query to clean up the address cache on plugin startup by deleting records with empty address fields.

Need to revisit whether the initial validation changes are still appropriate.

@s373nZ s373nZ force-pushed the validate-boltz-init-claim-destination-address branch from 1cf7746 to 58f96fc Compare October 25, 2024 13:14
@s373nZ
Copy link
Author

s373nZ commented Oct 25, 2024

Added an additional sanity check to not insert blank addresses from failed swaps into the address cache.

@s373nZ s373nZ force-pushed the validate-boltz-init-claim-destination-address branch from fba0bc3 to b267ef8 Compare October 25, 2024 13:41
@s373nZ
Copy link
Author

s373nZ commented Oct 25, 2024

Also added an SQL query run when the SwapManager is initialized which would remove any erroneously created records in the address cache which contain blank addresses.

Haven't tested this PR yet, as I only have a prod node right now, but taking this PR out of draft to solicit initial feedback.

@s373nZ s373nZ marked this pull request as ready for review October 25, 2024 13:45
@ksedgwic
Copy link
Collaborator

Chris Guida recommends checking whether CLBOSS needs any P2TR support.

@chrisguida
Copy link
Contributor

@ksedgwic I don't think that's the cause here as clboss is calling newaddr without params (which returns bc1q by default) and explicitly selects the bech32 key from the result.

But yes, generally it it probably good for us to make sure clboss is cool with taproot addresses.

@s373nZ
Copy link
Author

s373nZ commented Oct 26, 2024

I just tested the newaddr command on my node and it's returning:

lightning-cli newaddr
{
   "bech32": "bc1q<redacted>"
}

So... I think this is caused mainly by corruption of empty addresses getting into the SwapManager_addrcache cache table, possibly due to a prior upgrade issue of CLN or Bitcoin. Once an empty address gets in erroneously, it can get loaded from cache on the next swap attempts. When that swap attempt fails due to the empty address, it gets put back into the cache due to the missing validation, because the failure is assuming an issue with the Boltz workflow. So -- vicious cycle and a growing cache table of empty addresses.

I believe this PR has merit as-is and plan to try and test run this on my own node eventually, but will wait a few days for others to inspect their nodes' cache tables. Or maybe I will spin up a testnet Raspiblitz as @chrisguida suggested, if I can find the time.

@s373nZ s373nZ force-pushed the validate-boltz-init-claim-destination-address branch from b267ef8 to daa6469 Compare November 4, 2024 20:25
@s373nZ
Copy link
Author

s373nZ commented Nov 24, 2024

I've been running this PR branch on my mainnet node for the past three weeks. I can confirm that the database migration did indeed fix the empty rows in the cache table successfully. Additionally, while it took a few weeks for a swap to go through due to increasing fees, the node performed its first automated swap recently. From my testing, it looks good. Interested to know whether others have checked their swap address cache tables for this condition, or what other sort of testing might need to be done to consider this worthy for inclusion.

@s373nZ
Copy link
Author

s373nZ commented Dec 10, 2024

@ksedgwic In an effort to understand the root cause of this issue more completely, and hopefully assist in your review, here is an outline of code locations for consideration:

  1. The SwapManager table get inserted with a new row which doesn't have the address field populated in on_swap_request() here.
  2. Near the end of on_swap_request(), clboss uses kicks of a concurrent operation to process_needs_address() here.
  3. process_needs_address() is also kicked off periodically, here, raising a question for me about which event is generally responsible for mapping an address to the swap reliably.
  4. In fail_swap(), the SwapManager is inserting an address from the SwapManager table into the SwapManager_addrcache table here without checking to make sure the address exists (prior to the PR changes). We might guess failed swaps which are trying to conserve / reuse an address have a high likelihood of contributing to the root cause.
  5. The address assignment for swaps is handled in loop_needs_address(). Prior to this PR, it would just load the first value from SwapManager_addrcache, which could be empty, and assign it, causing an issue and subsequent swap failure. These changes ensure that doesn't happen anymore.

The implementation is using concurrency and the event driven model, which is cool, but makes it challenging for me to trace the issue mentally. It feels like there is a race condition. If two swaps kick off at the exact same time, and one fails prior to getting an address assigned (by the concurrent process), it populates the address cache table with an empty row, and this empty value gets reused in subsequent swap address assignments, hanging up future swaps. Not 100% sure this is the whole story, but hopefully on the right track and helps point you in the right direction to understand.

I was thinking to try and add some additional logging and a stacktrace when we detect an empty address is inserted into the cache table, but for now, I'm convinced that preventing that from happening is a comprehensive fix.

cc @chrisguida

@s373nZ s373nZ force-pushed the validate-boltz-init-claim-destination-address branch from daa6469 to 48a6ec2 Compare December 10, 2024 13:07
@s373nZ
Copy link
Author

s373nZ commented Dec 10, 2024

Rebased against master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants