Larges pushes fail, because they take too long #11

myieye · 2024-05-06T14:56:08Z

We recently had a push of almost 2GB (surprisingly it only took about 3 min for the client to upload it).
Here's an overview of when things broken:

It failed on the last chunk, because that's when the server actually starts doing the heavy lifting: trying to apply the commit.

And here's what I think/know happened:

Chorus sends the last chunk
- The resumable server detects that it's the last chunk
- It calls unbundle
- Which calls hg incoming, which takes ~3.5m (it gets logged to /var/cache/hgresume/<transaction-ID>.bundle.incoming.async_run)
Because the request takes so long
- Chorus cancels the push request after 30s (so Lexbox returns a 499)
- It's also fairly likely that PHP kills the script, because it takes longer than 30s and/or, because it was cancelled.
- It's also fairly likely that Cloudflare/the web server tear down the script process 🤷
So, presumably the PHP script gets torn down while it's waiting for the hg incoming command to finish (which does finish, because it's in its own process)
Because the PHP script gets torn down, no work actually happens: it never gets to running the command hg unbundle and creating a lockfile for that command (A lockfile is created for hg incoming, but that's a seperate file)
Because the lockfile doesn't get created, when Chorus retries the push-bundle, the server throws an Exception

Do we want to allow big pushes like this? I think so! So how:

It's fine if Chorus times out as long as the job actually happens and we returns a more meaningful response. The code tries to return a 200, but fails, because the lockfile it's expecting doesn't exist. The exception is good, because a missing lockfile means nothing is happening.

So we either need to:

Move more stuff into an external command that doesn't get torn down 🙁
Prevent the PHP script from getting torn down (e.g. move large pushes to a Lexbox Job and make sure we turn off everything that might kill a long PHP script)
Make the retries smarter and have them pick up where the last one died

I think 3 sounds like the best bet. Something like:

Replace the exception-throwing isComplete check, with something that anticipates this senario:
- If there's no lock file retry the unbundle
- In the unbundle, detect if hg incoming already ran and if so:
  - Do the necessary validation
  - Then start hg unbundle

The text was updated successfully, but these errors were encountered:

hahn-kev · 2024-05-06T16:18:25Z

I think to hard part about this is that to do it right we need to update the client to work properly with async jobs and anticipate that, then checking if the job is finished after the last push is finished. Otherwise the client thinks the job failed and tells the user as such, but then it didn't actually fail and the user is left very confused.

LHayashi · 2024-05-13T16:52:08Z

In most cases, you won't have as many media files to deal with initially - although with sound recording available in The Combine it will likely become more of an issue. Once the repo is set with all of those media files, the subsequent changes will be much less (unless someone has to redo the repo from scratch for some reason) and probably won't cause the issues. Are there other ways of initially populating a repo apart from the S/R process? It might be more efficient to have a human with special privileges to do this than fixing it.

And this discussion brings up the need for a media server that works with Combine, FLEx, LF, etc. :)

hahn-kev · 2024-05-17T14:56:39Z

If you can, I'd try to work off the docker branch I made, the commit history is kinda a mess atm.
I would be interested in knowing if returning something over the wire before hg incoming command finishes helps as often times cloudflare will kill the request if it thinks the server is dead due to not having responded at all, but if we could trickle out some data that should keep it alive, no idea how to do that in php, so we could do it in dotnet if needed, but it would obviously make more sense here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Larges pushes fail, because they take too long #11

Larges pushes fail, because they take too long #11

myieye commented May 6, 2024

hahn-kev commented May 6, 2024

LHayashi commented May 13, 2024

hahn-kev commented May 17, 2024

Larges pushes fail, because they take too long #11

Larges pushes fail, because they take too long #11

Comments

myieye commented May 6, 2024

hahn-kev commented May 6, 2024

LHayashi commented May 13, 2024

hahn-kev commented May 17, 2024