Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing backup transfer with core 1.142.12 #5993

Closed
link2xt opened this issue Sep 24, 2024 · 10 comments · Fixed by #6027
Closed

Failing backup transfer with core 1.142.12 #5993

link2xt opened this issue Sep 24, 2024 · 10 comments · Fixed by #6027
Assignees
Labels
bug Something is not working

Comments

@link2xt
Copy link
Collaborator

link2xt commented Sep 24, 2024

User reports failure to transfer
a profile from Delta Chat Desktop 1.46.8 (core 1.142.12, iroh-net 0.22.0)
to an Android phone.

Both devices are connected to an Android hotspot
open on the phone.

Android is stuck on 14%, Desktop shows no progress bar at all and remains on the QR code.

Desktop is running from the command line
and it showed this error printed by iroh to stderr:

stderr: 2024-09-24T08:40:39.163347Z ERROR iroh_quinn_udp:👿 got transmit error, halting segmentation offload

Desktop core log shows a lot of messages related to SMTP because it has multiple accounts.
The only relevant lines for backup transfer of the first account are these:

2024-09-24T08:40:39.084Z core/event INFO "" 1 "src/imex/transfer.rs:191: Received valid backup authentication token."
2024-09-24T08:41:03.120Z core/event WARNING "" 1 "src/imex/transfer.rs:240: Error while handling backup connection: Failed to write backup into QUIC stream: connection lost: timed out."
@link2xt link2xt added the bug Something is not working label Sep 24, 2024
@link2xt
Copy link
Collaborator Author

link2xt commented Sep 24, 2024

stderr warning comes from here (permalink to some version of quinn, did not check which one was actually used):
https://github.com/n0-computer/quinn/blob/1accbe97ec5d6e6d788debb5f45c08bdde6c3e5f/quinn-udp/src/unix.rs#L261

@link2xt
Copy link
Collaborator Author

link2xt commented Sep 24, 2024

Essentially it is sendmmsg failed with timeout. quinn has since been changed to use simply sendmsg and not buffer multiple packets in memory: quinn-rs/quinn#1729

It would be interesting to know how large sendmmsg timeout as essentially it is sendmmsg syscall returning ETIMEDOUT.

In the latest version sendmsg is here and can also fail with a timeout:
https://github.com/n0-computer/quinn/blob/765bff63786aaeedc17ee382ab11983abb3f475b/quinn-udp/src/unix.rs#L270

@link2xt
Copy link
Collaborator Author

link2xt commented Sep 24, 2024

On Linux timeout for sendmsg and sendmmsg should be controlled with SO_SNDTIMEO socket option, but I don't see the code which sets it and according to socket(7) the error should not be ETIMEDOUT even in this case.

@link2xt
Copy link
Collaborator Author

link2xt commented Sep 24, 2024

Maybe it is not ETIMEDOUT, but this error:
https://github.com/n0-computer/quinn/blob/765bff63786aaeedc17ee382ab11983abb3f475b/quinn-proto/src/connection/mod.rs#L3618-L3624

sendmmsg failed at 40:39, but this was probably just EAGAIN or something.
Connection failed at 41:03, so it is a different event and sendmmsg failure is not related.

@link2xt
Copy link
Collaborator Author

link2xt commented Sep 24, 2024

This does not look like default quinn 30-second max_idle_timeout, because we received tokenat 40:39, so it should have expired at most at 41:09, but we got an error at 41:03 already. Or maybe receiving something does not renew idle timeout?

@link2xt
Copy link
Collaborator Author

link2xt commented Sep 26, 2024

Could also be some buffering issue with async-tar when database size is large. Maybe async-tar gets stuck.
Previous DCBACKUP protocol did not use tar.

@link2xt link2xt self-assigned this Oct 4, 2024
@link2xt
Copy link
Collaborator Author

link2xt commented Oct 4, 2024

I can reproduce this with my largest account, transferring from Desktop to Android fails in a similar way. Going to debug this now.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 4, 2024

Transferring to the same device, into deltachat-repl, using receive-backup ... command, worked. This makes it harder to reproduce as I now have to use a phone to scan.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 4, 2024

Sending from deltachat-repl to the phone fails:

> send-backup
2024-10-04T19:48:26.159051Z  INFO deltachat_repl: src/sql.rs:834: Start housekeeping...
2024-10-04T19:48:26.378066Z  INFO deltachat_repl: src/sql.rs:873: 12089 files in use.
2024-10-04T19:48:26.381472Z  INFO deltachat_repl: src/sql.rs:925: Housekeeping: Keeping new unreferenced file #1: "welcome-image.jpg".
2024-10-04T19:48:26.946491Z  INFO deltachat_repl: src/sql.rs:815: Housekeeping done.

2024-10-04T19:48:34.438383Z  INFO deltachat_repl: Received IMEX_PROGRESS(300 ‰)
QR code: DCBACKUP2:
2024-10-04T19:49:38.897350Z  INFO deltachat_repl: src/imex/transfer.rs:179: Received valid backup authentication token.
2024-10-04T19:49:38.962714Z ERROR iroh_quinn_udp::imp: got transmit error, halting segmentation offload

So now I can recompile repl and try again.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 4, 2024

I have database that is 621M in size. Transfer appeared to be stuck at 12% on the receiver side, but then it got to 13% and progress bar appeared on the sender as well:

38.2s [i]core/event: 1 src/imex/transfer.rs:179: Received valid backup authentication token.
stderr: 2024-10-04T20:29:59.425232Z ERROR iroh_quinn_udp::imp: got transmit error, halting segmentation offload
[75431:1004/203022.831606:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 3 times!
195.8s [D]core/event: ImexProgress 1 { kind: 'ImexProgress', progress: 11 }
196.1s [D]core/event: ImexProgress 1 { kind: 'ImexProgress', progress: 12 }
196.3s [D]core/event: ImexProgress 1 { kind: 'ImexProgress', progress: 13 }
196.7s [D]core/event: ImexProgress 1 { kind: 'ImexProgress', progress: 14 }

So it just took a long time to transfer 600M of the database (2.6 minutes) and progress bar on the sender only appears when the database is transferred. EDIT: made a fix #6027

Another annoyance is that apparently cancelling backup transfer did not work:

209.9s [D]core/event: ImexProgress 1 { kind: 'ImexProgress', progress: 41 }
210.2s [D]core/event: ImexProgress 1 { kind: 'ImexProgress', progress: 42 }
210.3s [i]core/event: 1 src/context.rs:692: Signaling the ongoing process to stop ASAP.
210.4s [i]core/event: 1 src/context.rs:698: No ongoing process to stop.
210.6s [D]core/event: ImexProgress 1 { kind: 'ImexProgress', progress: 43 }

So all in all backup transfer works, but you have to be really patient and maybe wait 10-20 minutes if you have database that has a size of several gigabytes without any feedback. Maybe start tcpdump and see that packets are flowing :)

I will look at least into starting the progress bar on the sender sooner than transferring the whole database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant