-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for typo and directory changes in DYAD #40
Fixes for typo and directory changes in DYAD #40
Conversation
@hariharan-devarajan it looks like the build was OK, at least in the CI! Apologies for the path changes without updating the notebook, I didn't even open it, because I wouldn't have been oriented to work on it. I'll leave you and @ilumsden to work on this, but please ping me if/when it's ready for review. |
A few nits preparing for future review:
|
…-aws add: flux radiuss tutorial 2024
@hariharan-devarajan regarding the In my experience, errors like this usually occur on ARM systems (e.g., newer Macs) due to an architecture mismatch between the host system and the container. If this is the issue, it's easily solved using the |
Thanks @ilumsden. Currently, I am stuck with some deadlines. I would appreciate if u could test the Tutorial. If you have any issues, I can probably help u out. Let me know if u can try this out or not. |
Sounds good. That's my top priority this week (unless I need to do more work on a paper, but I suspect my main work for that paper is done). |
Problem: The tutorial series this year is no longer RADIUSS, but renamed to HPCIC. Solution: Rename all assets to HPCIC/hpcic, will await merge on confirmation from the hpcic/radiuss team! Signed-off-by: vsoch <[email protected]>
rename: radiuss 2024 to hpcic 2024
@vsoch the DYAD notebook is now fully working. I tested it on an x86 instance from Jetstream cloud. I still have to rebase this PR, but, other than that, this is ready for your review. |
96d0903
to
45ec7b7
Compare
@hariharan-devarajan since the |
What do u mean edit? U mean base it off master? The changes I did were minor. Could we just apply the changes on a new PR? |
This commit corrects logic in the the PyTorch data loader for DYAD. It also makes various corrections to the text in the DYAD notebook.
The flux-sched image for Ubuntu Jammy has a system install of UCX 1.12.0. However, we are wanting to use UCX 1.13.1 with DYAD. This commit updates LD_LIBRARY_PATH to point to UCX 1.13.1 to prevent runtime issues with DYAD.
In light of the name change of DLIO Profiler to DFTracer, this commit updates the env file created in the DYAD notebook to use the new names for environment variables.
This commit fixes a bug in the DYAD PyTorch data loader that causes 'brokers_per_node' to not be set before reference.
This commit tweaks the DLIO config file to use forking for multiprocessing instead of spawning
This commit changes cpu-affinity to off when running DLIO for training for consistency
45ec7b7
to
075be85
Compare
We could do that if you'd prefer. The stuff I've added to get everything working is pretty minor too. |
Please rebase before review - thank you! |
@vsoch @hariharan-devarajan so that we can merge into @vsoch please review that PR. |
So, the new structure didn't copy the images in the right place. Additionally, the path changes were not propagated into the notebook.
I fixed all typos. But the image need a fresh install of dlio_benchmark as it seems the architecture is not aching. I am getting
illegal instructions
error.@ilumsden Can u help me fix the image so that I can continue testing. Some things till left to do are
Module 4: 04_flux_tutorial_conclusions.ipynb the file is not found.
Then to run the code,