Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cutlass v3.1.0 #9

Merged
merged 5 commits into from
Jul 31, 2023
Merged

Conversation

regro-cf-autotick-bot
Copy link
Contributor

@regro-cf-autotick-bot regro-cf-autotick-bot commented May 24, 2023

It is very likely that the current package version for this feedstock is out of date.

Checklist before merging this PR:

  • Dependencies have been updated if changed: see upstream
  • Tests have passed
  • Updated license if changed and license_file is packaged

Information about this PR:

  1. Feel free to push to the bot's branch to update this PR if needed.
  2. The bot will almost always only open one PR per version.
  3. The bot will stop issuing PRs if more than 3 version bump PRs generated by the bot are open. If you don't want to package a particular version please close the PR.
  4. If you want these PRs to be merged automatically, make an issue with @conda-forge-admin,please add bot automerge in the title and merge the resulting PR. This command will add our bot automerge feature to your feedstock.
  5. If this PR was opened in error or needs to be updated please add the bot-rerun label to this PR. The bot will close this PR and schedule another one. If you do not have permissions to add this label, you can use the phrase @conda-forge-admin, please rerun bot in a PR comment to have the conda-forge-admin add it for you.

Closes: #6
Closes: #7
Closes: #10
Closes: #11

Dependency Analysis

We couldn't run dependency analysis due to an internal error in the bot. :/ Help is very welcome!

This PR was created by the regro-cf-autotick-bot. The regro-cf-autotick-bot is a service to automatically track the dependency graph, migrate packages, and propose package version updates for conda-forge. Feel free to drop us a line if there are any issues! This PR was generated by https://github.com/regro/cf-scripts/actions/runs/5073530612, please use this URL for debugging.

@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@h-vetinari
Copy link
Member

@conda-forge-admin, please rerender

@h-vetinari h-vetinari mentioned this pull request Jul 31, 2023
@h-vetinari
Copy link
Member

Ah:

CMake Warning at CMakeLists.txt:47 (message):
  CUTLASS 3.1.0 requires CUDA 11.4 or higher, and strongly recommends CUDA
  11.8 or higher.

Not sure this is going to work with CUDA 11.

@h-vetinari
Copy link
Member

@jakirkham @leofang @ngam @hmaarrfk
This seems to be working in principle, but after 5h, we're still barely 50% through the compilation. I guess this is blowing up due to extra architectures compared to the 11.2 builds

-- CUDA Compilation Architectures: 70;72;75;80;86;87;89;90;90a

though it could of course also be related to the version bump. Would it make sense to trim this list a bit? (how?)

If that's not possible - would someone have time / resources to build this locally?

@ngam
Copy link

ngam commented Jul 31, 2023

I’d say trimming the arches is okay, but I am not sure if people/developers use this package in ways more specialized. Having said that, this is already a pretty restrictive arch list…last time I thought about this, I settled on 60,70,75,80,86, 89,90,90a for cuda 12 (at least that’s what I will propose for conda-forge to use with jaxlib, tensorflow, and pytorch)

By the way, the build progress with these cuda-enabled packages can be misleading (usually not in our favor).

I will look into this soon if no response from more involved developers/maintainers 😺

@hmaarrfk
Copy link
Contributor

I was personally thinking of reviving a 1060 GPU of mine as a benchmarking compute.r I guess that is a bad idea.

We kinda went through the architecture trimming before, and found that we would have to trim way too much for it to be acceptable.

Maybe we should start building packages on a per GPU generation level?

@h-vetinari
Copy link
Member

Maybe we should start building packages on a per GPU generation level?

Yeah but AFAIK we have no metadata to select the right generation through virtual packages or similar. We only have the driver version, but that doesn't map 1:1 to architectures.

@hmaarrfk
Copy link
Contributor

we have no metadata

I mean, i feel like as more packages start to take longer and longer on CIs, we have to find an other solution. I'm also not a fan of the fat binaries, they are just slow to download....

@h-vetinari
Copy link
Member

I agree with you, but I'm asking how you imagine that would work? Perhaps conda needs a virtual package for the cuda arch of the machine?

@hmaarrfk
Copy link
Contributor

virtual package for the cuda arch of the machine

yeah. but you would have to make the assumption that there is only one kind of GPU. I feel like this is a "fair assumption". Mixing GPUs on one machine seems like a bad idea....

@hmaarrfk
Copy link
Contributor

in that same vain, i'm not sure how we are in targetting newer CPU x86-64 architectures. I feel like we target quite old instruction sets and might benefit from a bump there too. Again, this may blow up the build matrix.

@ngam
Copy link

ngam commented Jul 31, 2023

Re trimming… Yep, it was me instigating a fight on that front a while back and we essentially discovered we likely needed to go down the single arch route.

I can see us potentially supporting ensembles of these arches. The problem is, we likely need more insight from practitioners for this to practical and usedul. When I was heavily involved in this a year ago, I privately started building for a single CPU (say epyc xyz with all its recommended flags) and target only A100 GPUs (with all its recommended flags). I soon discovered that the HPCs I have access to had a stupid setup… compute nodes had A100s, so-called viz nodes had V100s, and yet still some miscellaneous nodes had K80s. You get the idea of how this can be problematic at least on one (small-ish) side of use cases. The other thing is, when compiling and training models, we have to be careful about interchangeability. Maybe the new keras paradigm can help with that, but likely not…

@ngam
Copy link

ngam commented Jul 31, 2023

There might be a way to optimize the compilation across arches (caching and reusing stuff, etc.) but I don’t know enough

@ngam
Copy link

ngam commented Jul 31, 2023

Anyway for this particular one, let’s figure our appropriately setting the arches and have someone build it. It’s only one binary, so not the worst… I guess I really should get my stupid singularity PR submitted again to get this going as we get more packages done for 12

@h-vetinari
Copy link
Member

I feel like we target quite old instruction sets and might benefit from a bump there too. Again, this may blow up the build matrix.

Yeah, this has been long overdue, but has stalled for a long time. Though there has been movement recently: conda/ceps#59

@h-vetinari
Copy link
Member

It’s only one binary, so not the worst…

Yeah, with this feedstock it's really just the wait for the compilation (🤞)

@hmaarrfk
Copy link
Contributor

log.txt

@hmaarrfk hmaarrfk merged commit 46eca4f into conda-forge:main Jul 31, 2023
@regro-cf-autotick-bot regro-cf-autotick-bot deleted the 3.1.0_ha6eafc branch July 31, 2023 13:39
@jakirkham
Copy link
Member

Thanks all! 🙏

Perhaps conda needs a virtual package for the cuda arch of the machine?

Would it make sense to convert this into a Conda issue for further discussion?

@hmaarrfk
Copy link
Contributor

I feel like I don't have a strong ask yet to keep the discussion focused

@jakirkham
Copy link
Member

Think that is ok. There's value in tracking the general need. Plus we can refine the ask into actionable steps through discussion

@h-vetinari
Copy link
Member

Thanks a lot @hmaarrfk!

@hmaarrfk
Copy link
Contributor

So a threadripper 2950 isn't the best processor, but it still shows about 14 hours of CPU time....

@h-vetinari
Copy link
Member

Yeah, it seems that cutlass is a big baby...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cutlass 3.0 CUTLASS: Support CUDA 12
5 participants