Will we be able to write interesting parallel programs in Mojo? #1147

raphlinus · 2023-10-24T05:04:26Z

raphlinus
Oct 24, 2023

Apologies for the provocative title. I will attempt to define more precisely what is meant by "interesting" parallel programs and what motivates the question. Running pipelines containing nontrivial data structures (including trees) and algorithms (mostly computational geometry) on GPUs has been my primary research focus for the past few years. I would love to see a higher level language emerge that can express such computations concisely, and run them efficiently.

An intermediate question is: will I be able to run a scan (generalized prefix sum) over an arbitrary monoid, as can readily be expressed in CUB? This is necessary but not sufficient for my purposes. I've been following MLIR some, and last I checked there is a vector.scan operation, but it's limited to vanilla arithmetic. One simple example of a non-commutative monoid is doing segmented reductions, in my case for computing bounding boxes. (For more background on segmented reduction, see Section 1.5 of Blelloch, and also an implementation in Futhark).

Beyond that, I'd like to be able to express an efficient implementation of the parentheses matching problem, see Voetter et al for an implementation of the Bar-On and Vishkin algorithm in Futhark. Note also that Aaron Hsu has explored doing this problem using an APL-like array language, but that implementation is not work-efficient, nor does it scale well to arbitrary nesting depth. (I'm referring to papers in the literature to help motivate that these are general problems worth solving, but for those interested, I can also point to my own work which is an implementation in portable compute shaders).

A huge part of doing this well is working around the poor or broken execution model in the underlying GPU to the extent necessary. As one example, Metal Shading Language has a very weak memory model that cannot support message passing between workgroups, thus cannot implement single-pass scan such as the decoupled look-back algorithm. My ideal language would allow me to express the problem at a high level, and have the compiler select the least broken implementation strategy for the target hardware.

Another problem is that the classic grid-dispatch execution model basically forces you to allocate large buffers (the size of which may be difficult to predict in advance), and do wait-for-idle between stages. Far better would be to structure the computation as a pipeline where persistent threads of producer kernels fill queues, and those of other threads drain them, as was presaged in the GRAMPS paper. The recent work graphs complication of the compute shader model is a step in this direction as well, but has serious limitations: no joins, fixed size data only, and no ability to maintain ordering constraints. (The ability to call into an efficient general sort implementation may mitigate that last concern, but I'm not easily finding that within the MLIR universe either).

I am more than happy to discuss such things in more detail, as well as supply pointers to existing implementation and exploration work. I don't think the "good parallel computer" has been built yet which would be a proper compilation target, though I am very hopeful about the particular strain of AI accelerators built as a grid of general purpose CPUs. That type of hardware is calling for a suitable higher level language, and I would love for Mojo to become that, but I can also see the challenges.

Answered by lattner

Oct 29, 2023

Ah, ok. We haven't announced Mojo's support for GPUs yet, but we're working on it and very excited about the results we see so far :-). Our approach for all devices is generally two-level: 1) both expose low-level access and control over the hardware in whatever "native way" the device programming model is, and 2) provide higher level abstractions built in Mojo libraries over 1 that are more accessible than the low-level access.

For comparison, Triton-lang is an example of a simpler programming model that is popular in some camps, but it eliminates low level access to hardware and exposes a subset of the capabilities of a GPU. This is very useful, but we don't believe in taking away low l…

View full answer

lattner · 2023-10-28T05:34:36Z

lattner
Oct 28, 2023
Maintainer

Hi @raphlinus, sorry for the delay responding. TL;DR: Yes, Mojo builds on a full general runtime (at least on CPUs; certain other accelerators will have other hw limitations of course) and can express fully general patterns of parallelism including tree and even unstructured parallelism. The underlying runtime uses a lightweight workqueue based thread pool, which is optimized for a 1-1 mapping to hardware threads, and assumes that work items never block (which makes the execution model much simpler and more efficient than thread pools that try to "autoscale" kernel threads to support blocking tasks).

As with most of the Mojo stdlib, the apis for this are not really built out yet, but we'll do that as other higher level language features get built in.

0 replies

raphlinus · 2023-10-29T15:23:26Z

raphlinus
Oct 29, 2023
Author

Excellent, a lot of what you say sounds good. A workqueue thread pool seems fundamental to CPU execution of these types of workloads, and having that in the runtime is promising. And if you can express a parallel program at a high level, capturing programmer intent, and map that to CPU execution, that's a good start.

That said, my primary interest is running these programs on GPUs, simply because they offer approximately an order of magnitude improvement in throughput per watt or throughput per dollar (which is basically the same thing) over CPU execution. There are many workloads today you wouldn't even consider running on CPU, and that includes both high performance graphics rendering and most AI. And there are research languages that target these platforms, I'm thinking Futhark, co-dfns, and to a lesser extent Circle.

Part of the context is that I'm writing a blog post about computational models that might be a suitable successor to the current mess that is GPU infrastructure. A major focus of that blog post will be "agility," which GPUs are bad at - they generally want to dispatch parallelepiped shaped grids with poor handling of sparse or dynamically shaped inputs. One direction is to sacrifice agility to focus entirely on matrix multiply and convolution throughput, which might be a suitable approach for many AI workloads. Another approach is to make the core compute engine even more agile, which is what we're seeing on the "grid of RISC-V vector cores" designs. To me, the idea of expanding the range of practical parallel computations beyond vanilla matrix multiply is exciting, but it blocks on software infrastructure, in particular a language that is good at expressing these kinds of computations.

Since the question is very broad, it might be useful to zero in on some specific questions:

Will we be able to express scan over an arbitrary monoid in Mojo and have that map to GPU execution?
Will that map to a single-pass implementation (decoupled look-back) on GPUs with a decent memory and execution model, with performance comparable to CUB, and a multi-dispatch tree reduction implementation on GPUs with a poorer memory model, such as Metal and WebGPU? (in time, I'd like to see a stronger memory model as an optional extension in WebGPU, in which case this question turns into being able to dynamically detect the memory model strength at runtime and select the best implementation).
Will it be possible to implement a parallel parentheses matching algorithm, for example the Bar-On and Vishkin algorithm, and have that execute efficiently on GPU?

These are hard problems, but if the answers are "yes," then Mojo becomes a lot more exciting to me, and I'd plan on exploring more deeply.

0 replies

lattner · 2023-10-29T21:58:13Z

lattner
Oct 29, 2023
Maintainer

Ah, ok. We haven't announced Mojo's support for GPUs yet, but we're working on it and very excited about the results we see so far :-). Our approach for all devices is generally two-level: 1) both expose low-level access and control over the hardware in whatever "native way" the device programming model is, and 2) provide higher level abstractions built in Mojo libraries over 1 that are more accessible than the low-level access.

For comparison, Triton-lang is an example of a simpler programming model that is popular in some camps, but it eliminates low level access to hardware and exposes a subset of the capabilities of a GPU. This is very useful, but we don't believe in taking away low level access, and we also like things like this to be libraries - not entire compiler and ecosystem stacks.

In any case, I'd love to see your blog post when you have it ready. If you're interested, I put together this talk a few years ago which is related (but thinking has absolutely evolved since then):
https://docs.google.com/presentation/d/1ZMtzT6nmfvNOlIaHRzdaXpFeaAklcT7DvfGjhgpzcxk/edit#slide=id.p

We also published this talk about Mojo and its internal design specifically recently. The slides are available here:
https://www.modular.com/blog/mojo-llvm-2023

The video will be available when the LLVM folk get them posted.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will we be able to write interesting parallel programs in Mojo? #1147

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Will we be able to write interesting parallel programs in Mojo? #1147

raphlinus Oct 24, 2023

Replies: 3 comments

lattner Oct 28, 2023 Maintainer

raphlinus Oct 29, 2023 Author

lattner Oct 29, 2023 Maintainer

raphlinus
Oct 24, 2023

lattner
Oct 28, 2023
Maintainer

raphlinus
Oct 29, 2023
Author

lattner
Oct 29, 2023
Maintainer