Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Larges pushes fail, because they take too long #11

Open
myieye opened this issue May 6, 2024 · 3 comments
Open

Larges pushes fail, because they take too long #11

myieye opened this issue May 6, 2024 · 3 comments

Comments

@myieye
Copy link
Collaborator

myieye commented May 6, 2024

We recently had a push of almost 2GB (surprisingly it only took about 3 min for the client to upload it).
Here's an overview of when things broken:
image

It failed on the last chunk, because that's when the server actually starts doing the heavy lifting: trying to apply the commit.

And here's what I think/know happened:

  • Chorus sends the last chunk
    • The resumable server detects that it's the last chunk
    • It calls unbundle
    • Which calls hg incoming, which takes ~3.5m (it gets logged to /var/cache/hgresume/<transaction-ID>.bundle.incoming.async_run)
  • Because the request takes so long
  • So, presumably the PHP script gets torn down while it's waiting for the hg incoming command to finish (which does finish, because it's in its own process)
  • Because the PHP script gets torn down, no work actually happens: it never gets to running the command hg unbundle and creating a lockfile for that command (A lockfile is created for hg incoming, but that's a seperate file)
  • Because the lockfile doesn't get created, when Chorus retries the push-bundle, the server throws an Exception

Do we want to allow big pushes like this? I think so! So how:

  • It's fine if Chorus times out as long as the job actually happens and we returns a more meaningful response. The code tries to return a 200, but fails, because the lockfile it's expecting doesn't exist. The exception is good, because a missing lockfile means nothing is happening.

So we either need to:

  1. Move more stuff into an external command that doesn't get torn down 🙁
  2. Prevent the PHP script from getting torn down (e.g. move large pushes to a Lexbox Job and make sure we turn off everything that might kill a long PHP script)
  3. Make the retries smarter and have them pick up where the last one died

I think 3 sounds like the best bet. Something like:

  • Replace the exception-throwing isComplete check, with something that anticipates this senario:
    • If there's no lock file retry the unbundle
    • In the unbundle, detect if hg incoming already ran and if so:
      • Do the necessary validation
      • Then start hg unbundle
@hahn-kev
Copy link
Collaborator

hahn-kev commented May 6, 2024

I think to hard part about this is that to do it right we need to update the client to work properly with async jobs and anticipate that, then checking if the job is finished after the last push is finished. Otherwise the client thinks the job failed and tells the user as such, but then it didn't actually fail and the user is left very confused.

@LHayashi
Copy link

In most cases, you won't have as many media files to deal with initially - although with sound recording available in The Combine it will likely become more of an issue. Once the repo is set with all of those media files, the subsequent changes will be much less (unless someone has to redo the repo from scratch for some reason) and probably won't cause the issues. Are there other ways of initially populating a repo apart from the S/R process? It might be more efficient to have a human with special privileges to do this than fixing it.

And this discussion brings up the need for a media server that works with Combine, FLEx, LF, etc. :)

@hahn-kev
Copy link
Collaborator

If you can, I'd try to work off the docker branch I made, the commit history is kinda a mess atm.
I would be interested in knowing if returning something over the wire before hg incoming command finishes helps as often times cloudflare will kill the request if it thinks the server is dead due to not having responded at all, but if we could trickle out some data that should keep it alive, no idea how to do that in php, so we could do it in dotnet if needed, but it would obviously make more sense here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants