-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow CDN mirroring #892
Comments
We see some errors in the channel-clone VMs, we are checking if this is related |
Thanks Stefan! 🙏 Interested to learn what you discover 🙂 |
Another example after John referred me to this issue: 2+ hours after marking some packages as broken in conda-forge/admin-requests@164a427, packages are still not removed from the CDN (it's difficult to do quickfixes to wide-ranging breaks like this one if the turnaround is that long).
|
We see some files generated by anaconda.org's dynamic repodata like |
Interesting, thanks Daniel! 🙏 Is there something in regards to those packages specifically that looks relevant? Or did something unrelated to those packages occur (like a network outage or running out of disk space)? |
Were we able to determine the cause here? |
We were able to find packages that have repodata but no downloadable archive missing-packages.txt We were also able to fix a bug that is more likely to be the cause, where we would have trouble re-downloading a package in the CDN process if the first clone failed. We were not able to find the precise cause. |
Gotcha thanks for the update Daniel! 🙏 Looked at the first one on the list, The second package, Looking at the latter two cases, do not see them in Idk if we can have aborted copy with the conda-forge validation service that might generate these issues, but that seems like one question that comes out of this cc @beckermr (in case I'm missing anything here) |
We download the dynamic anaconda.org repodata.json before creating the CDN version. |
Can you please remind me which URL that lives under? |
The logs show that the CDN clone process downloaded that archive from https://conda-web.anaconda.org/conda-forge/linux-64/clangdev-18.1.2-default_h127d8a8_0.conda, had a bad archive at 2024-03-21T02:29:30 and was able to get a good archive at 2024-03-21T06:12 |
Not that I pretend to understand the cloning mechanism (or the reasons why it might fail), but would it make sense to have a shorter retry loop for failed clones? Like try again immediately after, or after X minutes delay? |
It does retry frequently, there may be an intermediate cache issue. |
The CDN appears to be down again. |
Approaching the 500min mark 😬 Should that metric be part of https://anaconda.statuspage.io/? |
We've addressed a disk-full issue. |
Just ran into a a network issue: conda.CondaMultiError: ('Connection broken: IncompleteRead(199522674 bytes read, 79146888 more expected)', IncompleteRead(199522674 bytes read, 79146888 more expected))
('Connection broken: IncompleteRead(199522674 bytes read, 79146888 more expected)', IncompleteRead(199522674 bytes read, 79146888 more expected)) Wondering if this is related |
Yes, I think we should start tracking this publicly somehow |
CDN is at 37 minutes. |
Should be resolved. |
Last sync was done almost 10h ago now. |
I confirmed it was not updated for 24 hours. |
|
Thanks Daniel! 🙏 Debug info:
|
Let's try reducing the cache TTL. |
That looks promising
|
It looks like both packages are now available:
This latest delay is most likely due to the issues with the anaconda.org backend (xref: #899). The channel cloning process relies on various calls to .org's API; the .org database was sporadically triggering the OOM (out-of-memory) killer on the backend host. Anaconda's infrastructure team has expanded memory and scaling allocation for the database backend, and that should help stabilize things again. |
Could we write a process that periodically checks repodata-clone.json versus repodata.json to keep track of any delay between packages appearing in the former versus the latter |
Thanks Cheng and Daniel! 🙏 Daniel, think that is a good idea. If there is some way to share log details or maybe graphs on resource usage, that might help as well We were also wondering if it would make sense to have a GH template for CDN issues ( #912 ). Are there specific pieces of info we should be capturing that would help narrow things down? |
Seeing issues today with https://anaconda.org/conda-forge/hyperion-fortran/files, not available after 2h |
To know whether recent packages can be pulled, one can watch the last update of
|
It looks like it is taking over 1hr for some packages to mirror. For example:
Do we know what might be causing this?
The text was updated successfully, but these errors were encountered: