Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight #8975

Open
ziggie1984 opened this issue Aug 5, 2024 · 8 comments · May be fixed by #9150
Open

[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight #8975

ziggie1984 opened this issue Aug 5, 2024 · 8 comments · May be fixed by #9150
Assignees
Labels
bug Unintended code behaviour payments Related to invoices/payments
Milestone

Comments

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Aug 5, 2024

Remarked in #8174

Analysed some of the behavior here:

#8174 (comment)

What I think might be taken into consideration to change (from my point of understanding the codebase here)

When sending a payment LND will by default attempt MPP payment up to 16 shards per payment. We launch all of the asynchronously however when during the process one of the HTLC attempts fails because of specific reasons e.g. requesting a route fails or we cannot register a payment attempt:

rt, err := p.requestRoute(ps)
if err != nil {
return exitWithErr(err)
}
// We may not be able to find a route for current attempt. In
// that case, we continue the loop and move straight to the
// next iteration in case there are results for inflight HTLCs
// that still need to be collected.
if rt == nil {
log.Errorf("No route found for payment %v",
p.identifier)
continue lifecycle
}
log.Tracef("Found route: %s", spew.Sdump(rt.Hops))
// We found a route to try, create a new HTLC attempt to try.
attempt, err := p.registerAttempt(rt, ps.RemainingAmt)
if err != nil {
return exitWithErr(err)
}

we abort the payment cycle and don't wait for INFLIGHT HTLC attempts to be resolved, we will only reconsider them during a restart.

One of the examples described in #8174 (comment)
is that as soon as the MPP times-out at the receiver side, but we are in the process of sending another HTLC, we will fail the payment lifecycle without resolving all the INFLIGHT htlcs.

This has downsides:

  1. On the one hand the payment will always be inflight and service providers using the API might not be aware that this payment has no chance of resolving successfully because the MPPTimeout has already been reached for example.

  2. Moreover when described to the Trackpayment API the payment stays INFLIGHT/PENDING forever.

Possible solutions, interested in feedback here:

I think it makes sense to store the HTLC Attempts in Memory during the excecution of the payment so that the payment lifecycle can always be aware how many HTLCs have been sent ? I store them in the DB but I think it might be worthwhile to also have the easy accessible in life cycle struct maybe ? (Currently we only have 1 golang channel which listens to potential results of all the result collectors)

@ziggie1984 ziggie1984 added bug Unintended code behaviour intermediate Issues suitable for developers moderately familiar with the codebase and LN payments Related to invoices/payments needs triage and removed needs triage intermediate Issues suitable for developers moderately familiar with the codebase and LN labels Aug 5, 2024
@yyforyongyu
Copy link
Member

I think the question is, should we ever abort the payment lifecycle even when there are inflight HTLCs? The answer is yes, but only if we get a non-MPP-related error, such as a DB-related error, or other lower-level critical errors.

The previous assumption is, we should only rely on this func to decide if we want to abort the loop,

// Now decide the next step of the current lifecycle.
step, err := p.decideNextStep(payment)
if err != nil {
return exitWithErr(err)
}

which works most of the time given the setup that,

  1. the invoice has a timeout of 60s
  2. finding a route via RequestRoute is fast (in million seconds)
  3. creating and sending the HTLC attempt is fast.

but it can break if,

  1. one or more HTLCs are in flight.
  2. the sender node shuts down before the invoice times out and restarts after the invoice times out.
  3. we call decideNextStep before the payment is marked as failed, and proceed to register another attempt.
  4. we call RegisterAttempt, now it fails as we cannot register attempts on a failed payment.

The root cause is we have two different views of the payment in two goroutines,

  • G1: in resumePayment, we always fetch the latest payment state for each iteration, and make decisions based on this state.
  • G2: in collectResultAsync, we will alter the payment state based on the collected result from htlcswitch.

The simple fix is to always read and write to the DB in the same goroutine, which should happen in resumePayment. Since the only slow action is to collect results from htlcswitch via,

resultChan, err := p.router.cfg.Payer.GetAttemptResult(
attempt.AttemptID, p.identifier, errorDecryptor,
)

A better fix is to refactor deeper to make the DB actions clearer, which I think can be accomplished in the upcoming native payment SQL PRs.

I think it makes sense to store the HTLC Attempts in Memory during the excecution of the payment so that the payment lifecycle can always be aware how many HTLCs have been sent ?

The more we rely on ephemeral states the less fault tolerant it becomes. Suppose a power outage happens all the results will be lost - unless we can reconstruct them, say, if htlcswitch has stored it on disk.

@ziggie1984
Copy link
Collaborator Author

ziggie1984 commented Aug 6, 2024

I think the question is, should we ever abort the payment lifecycle even when there are inflight HTLCs? The answer is yes, but only if we get a non-MPP-related error, such as a DB-related error, or other lower-level critical errors.

Hmm I tend to rather make sure we resolve all INFLIGHT HTLCs even if we have a lower-level critical error so that we make sure we mark this Payment and all its HTLCs as somehow unrecoverable, so that at least the trackpayment cmd signals the exit of the payment correctly.

but it can break if,

one or more HTLCs are in flight.
the sender node shuts down before the invoice times out and restarts after the invoice times out.
we call decideNextStep before the payment is marked as failed, and proceed to register another attempt.
we call RegisterAttempt, now it fails as we cannot register attempts on a failed payment.

Do you mean those steps all must happen in sequential order for the payment lifecycle to break or does every point represents a case where the payment lifecycle fails ? I guess it's the former but I don't think Point 2 is necessary for it to break ?

The simple fix is to always read and write to the DB in the same goroutine, which should happen in resumePayment. Since the only slow action is to collect results from htlcswitch via,

Depending on when the Payment Sql refactor is planned for I would prefer going with a simple fix first and then proceeding. Your recommendation of writing to the db in resumepayment is definitely a good way to solve it and its seems not super complicated.

While looking through the code I am wondering why we mutate the state of the payment in a fetchPayment method:

func fetchPayment(bucket kvdb.RBucket) (*MPPayment, error) {

lnd/channeldb/payments.go

Lines 314 to 319 in ab96de3

// Set its state and status.
if err := payment.setState(); err != nil {
return nil, err
}
return payment, nil

I think we should not change the state in a fetch method or we should call this fetch method update wdyt ?

@michael1011
Copy link
Contributor

This keeps happening for us on a regular basis on mainnet. There is no way to prevent it apart from not using MPP and even that is just an educated guess.

@yyforyongyu yyforyongyu added this to the v0.19.0 milestone Aug 27, 2024
@yyforyongyu yyforyongyu linked a pull request Oct 2, 2024 that will close this issue
2 tasks
@calvinrzachman
Copy link
Contributor

Following this issue at somewhat of a distance at the moment, but very curious to see the native payment SQL PRs and any continued development on @yyforyongyu's comment:

unless we can reconstruct them, say, if htlcswitch has stored it on disk.

Coming from the perspective of running the ChannelRouter remotely from the Switch and allowing communication via RPC, I have found myself getting the feeling that Switch store might need some upgrades. In such a scenario, you have the ChannelRouter driving state transitions about the status of a locally initiated payment (eg: non-forward) using state maintained by a remote Switch. The existing networkResultStore may not be quite sufficient for this.

@kilrau
Copy link
Contributor

kilrau commented Nov 21, 2024

This keeps happening for us on a regular basis on mainnet. There is no way to prevent it apart from not using MPP and even that is just an educated guess.

BTW we still have several cases per week, we can't wait to have this fix in. If you need anything from us, don't be shy!

@yyforyongyu
Copy link
Member

#9150 should fix this issue.

@kilrau maybe you could provide some relevant debug logs? Just to make sure #9150 covers it.

@nordbjorn
Copy link

Seems like we are affected by this problem as well. We have seen some payments, using MPP, get stuck as InFlight on our nodes. The pattern so far is that there are failed HTLCs, and one or more HTLC which is stuck as InFlight, even after it expires. Restarting the nodes resolved these payments to a Failed state.

Is there any other reasonable workaround, since restarting the nodes can be painful?

@lnproxy
Copy link

lnproxy commented Jan 16, 2025

Also experiencing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour payments Related to invoices/payments
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants