-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight #8975
Comments
I think the question is, should we ever abort the payment lifecycle even when there are inflight HTLCs? The answer is yes, but only if we get a non-MPP-related error, such as a DB-related error, or other lower-level critical errors. The previous assumption is, we should only rely on this func to decide if we want to abort the loop, lnd/routing/payment_lifecycle.go Lines 233 to 237 in ab96de3
which works most of the time given the setup that,
but it can break if,
The root cause is we have two different views of the payment in two goroutines,
The simple fix is to always read and write to the DB in the same goroutine, which should happen in lnd/routing/payment_lifecycle.go Lines 509 to 511 in ab96de3
A better fix is to refactor deeper to make the DB actions clearer, which I think can be accomplished in the upcoming native payment SQL PRs.
The more we rely on ephemeral states the less fault tolerant it becomes. Suppose a power outage happens all the results will be lost - unless we can reconstruct them, say, if |
Hmm I tend to rather make sure we resolve all INFLIGHT HTLCs even if we have a lower-level critical error so that we make sure we mark this Payment and all its HTLCs as somehow unrecoverable, so that at least the
Do you mean those steps all must happen in sequential order for the payment lifecycle to break or does every point represents a case where the payment lifecycle fails ? I guess it's the former but I don't think Point 2 is necessary for it to break ?
Depending on when the Payment Sql refactor is planned for I would prefer going with a simple fix first and then proceeding. Your recommendation of writing to the db in While looking through the code I am wondering why we mutate the state of the payment in a fetchPayment method: Line 274 in ab96de3
Lines 314 to 319 in ab96de3
I think we should not change the state in a fetch method or we should call this |
This keeps happening for us on a regular basis on mainnet. There is no way to prevent it apart from not using MPP and even that is just an educated guess. |
Following this issue at somewhat of a distance at the moment, but very curious to see the native payment SQL PRs and any continued development on @yyforyongyu's comment:
Coming from the perspective of running the ChannelRouter remotely from the Switch and allowing communication via RPC, I have found myself getting the feeling that Switch store might need some upgrades. In such a scenario, you have the ChannelRouter driving state transitions about the status of a locally initiated payment (eg: non-forward) using state maintained by a remote Switch. The existing networkResultStore may not be quite sufficient for this. |
BTW we still have several cases per week, we can't wait to have this fix in. If you need anything from us, don't be shy! |
Seems like we are affected by this problem as well. We have seen some payments, using MPP, get stuck as Is there any other reasonable workaround, since restarting the nodes can be painful? |
Also experiencing this issue |
Remarked in #8174
Analysed some of the behavior here:
#8174 (comment)
What I think might be taken into consideration to change (from my point of understanding the codebase here)
When sending a payment LND will by default attempt MPP payment up to 16 shards per payment. We launch all of the asynchronously however when during the process one of the HTLC attempts fails because of specific reasons e.g. requesting a route fails or we cannot register a payment attempt:
lnd/routing/payment_lifecycle.go
Lines 258 to 280 in 7e1e054
we abort the payment cycle and don't wait for INFLIGHT HTLC attempts to be resolved, we will only reconsider them during a restart.
One of the examples described in #8174 (comment)
is that as soon as the MPP times-out at the receiver side, but we are in the process of sending another HTLC, we will fail the payment lifecycle without resolving all the INFLIGHT htlcs.
This has downsides:
On the one hand the payment will always be inflight and service providers using the API might not be aware that this payment has no chance of resolving successfully because the MPPTimeout has already been reached for example.
Moreover when described to the
Trackpayment
API the payment stays INFLIGHT/PENDING forever.Possible solutions, interested in feedback here:
I think it makes sense to store the HTLC Attempts in Memory during the excecution of the payment so that the payment lifecycle can always be aware how many HTLCs have been sent ? I store them in the DB but I think it might be worthwhile to also have the easy accessible in life cycle struct maybe ? (Currently we only have 1 golang channel which listens to potential results of all the result collectors)
The text was updated successfully, but these errors were encountered: