-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gcp-dataproc] Dataproc open source component integration tests flakeyness - regional mirroring? #1051
Comments
Both |
It might just be as easy as specifying a mirror in the call to mamba/conda.
https://github.com/cjac/initialization-actions/blob/rapids-20240806/rapids/rapids.sh#L473
…On Fri, Oct 25, 2024, 18:21 jakirkham ***@***.***> wrote:
Both conda-forge and nvidia channels should be available by CDN via
Cloudflare. Am curious why in this case it appears to be going to
Anaconda.org directly?
—
Reply to this email directly, view it on GitHub
<#1051 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAM6UTHO24G2ENVNSJG24DZ5LVAFAVCNFSM6AAAAABQUDV4B6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGE2TOOJYGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
What I mean is this should already be happening by default. For example note the last line in the output below $ curl -I https://conda.anaconda.org/conda-forge
HTTP/2 302
date: Sat, 26 Oct 2024 01:49:17 GMT
content-type: text/html; charset=utf-8
location: https://anaconda.org/conda-forge/repo?type=conda&label=main
…
server: cloudflare The fact that the query above is not getting through suggests there is some other kind of network issue. Not sure if that is somewhere within CI or some other infrastructure between that build and the CDN (like some security protocol?) It might be worth trying some simple network diagnostics at this point outside of Conda to isolate issues like this |
|
Hello folks, it looks like this is becoming a problem. I'm sorry for swamping your service. Let's get a regional conda mirror set up as part of the product I'm producing. Can you please direct me to the best instructions on mirroring the full conda archive? I will work on bringing up a load balancer to direct the traffic to our local mirror and take that load off of your infrastructure. |
Were you able to run the command suggested above ( #1051 (comment) )? It would be good to know if Cloudflare (the CDN provider used for conda-forge) is actually used in your case or not |
oops! Sorry, I think I missed that. |
Oh, sorry! I didn't know you were asking me to run that command from the context of one of the cluster nodes being installed to. Here is that output now.
|
Do these channels make a difference? Are those mirrored as well? -c conda-forge -c nvidia -c rapidsai |
This looks like it might be what I need: |
Sorry for being unclear. Thanks for the info! 🙏 Ok so you are able to reach the CDN through Both Currently Let's see if someone can help before going down the mirroring route @jezdez could you please help us look into this? |
okay. I started the mirroring route because it might be faster to have a local copy. Let me compare and let you know whether it's too much effort to maintain a mirror for use with my reproduction environment. I've got a couple of files in my example. sync-mirror.sh is run on an instance created using create-conda-mirror.sh. Please pardon the mess. I re-used some code I was using for a different purpose. The docs that I read about mirrors suggested that attaching GPUs to the mirror host might help accelerate things, too, so I used the latest rapids image and attached 4x T4s. |
wow. It looks like I got cut off. root@dpgce-conda-mirror-us-west4:~# links https://conda.anaconda.org/defaults/linux-64/repodata.json
|
It looks like I was attempting to mirror portions of the repo that I don't need and won't help our cache. The current implementation looks promising. The first one resulted in a mirror with size of ~120GB. I think it may have been the nvidia channel alone. I attempted to pass multiple instances of the --upstream-channel argument, and it took only the last. After learning from this mistake, I have bifurcated the previous, simple, and incorrect single conda-mirror call into concurrent conda-mirror calls in their own screen tabs. Since this is a long-running process, it's probably best not to have it fail when a terminal is detached. And once all of the tabs have completed, the screen session will terminate and return control to the sync-mirror.sh shell process. I am about 20 minutes into this latest run. It picked up in the mirroring where it had left off despite the deletion of the previous VM that had been running it. I increased the memory and CPU count so that it can accommodate three concurrent conda-mirror processes. Here's a snapshot of disk usage.
|
This question moved to a different forum |
https://conda.anaconda.org/main/linux-64/repodata.json is the correct repodata URL for Anaconda Distribution |
@cjac I'm not aware of any throttling from GCP. The original issue seems to have been a transient connection error, is this really still happening from GCP? The channels are hosted on Cloudflare CDN. For the other questions, if this relates to commercial support for GCP related services, this isn't the right repo to raise an issue, please reach out through your Anaconda support channels instead. |
I have not tried to reproduce the issue yet. I'm going to finish building a mirror and use a locally mounted filesystem with the packages on it to provide the conda-forge, rapidsai and nvidia channels. Once the mirror is up, probably by monday, I will try the build of the rapids image again, this time using file:///var/www/html/«channel» instead of https://conda.anaconda.org/«channel» I can then share the example instruction on how to build and utilize a conda mirror, and close this issue. |
The mirror has been built, but it seems conda does an extra write of ~15G to the temp directory, much of which could be skipped when the source is on a file:// path. In any case, the code which I used to build the anaconda mirror can be found here: https://github.com/cjac/dataproc-repro/blob/conda-mirror-20241031/lib/mirror/sync-conda.pl On a 96 core machine, I believe that it could mirror the channels we use in about 8 hours. |
Updating this issue to note that So to summarize the following channels are available via CDN
|
I've worked around it a bit and plan to cache successful I'll try running a few builds tomorrow to check. |
Checklist
What is the idea?
Hello folks,
I've been maintaining the github.com/GoogleCloudDataproc/initialization-actions repository for a bit now, and I'm seeing some flakey tests. The tests are installing dask from conda.anaconda.org. Would we be able to avoid this by using a regional GCP mirror of the conda packages? How complex is it to maintain a mirror with CVE updates?
Why is this needed?
reduce load on the global mirrors and keep installer's resources locally to GCP
What should happen?
mirror with CVE updates created for each GCP region
Additional Context
Tests were run during work on this pull request.
GoogleCloudDataproc/initialization-actions#1219
The text was updated successfully, but these errors were encountered: