Guidelines for writing scalable and portable Mojo code #164

prabhuramachandran · 2023-05-16T06:45:50Z

prabhuramachandran
May 16, 2023

Would it be reasonable for someone to write Mojo code once for say a multi-core CPU and hope that the same code would perform well on an accelerator (perhaps with small modifications but without requiring a rewrite)? This isn't immediately clear from the documentation. It would really help to know if there are any specific guidelines for writing portable Mojo code that would scale well on different hardware. Is this even a reasonable expectation?

Eventually, would Mojo allow for automatic optimization and parallelization of heavily numerical code via say polyhedral optimization? Would this be automatic or would this require writing the code in a specific way? The current matrix multiplication example provided in the documentation does require writing the code in a specific way, would this change so users could write it naively and expect similar performance as a specially hand-optimized version? Or is the approach used in the documentation the best way to do so for the forseeable future?

Answered by Mogball

May 16, 2023

Eventually, would Mojo allow for automatic optimization and parallelization of heavily numerical code via say polyhedral optimization? Would this be automatic or would this require writing the code in a specific way? The current matrix multiplication example provided in the documentation does require writing the code in a specific way, would this change so users could write it naively and expect similar performance as a specially hand-optimized version? Or is the approach used in the documentation the best way to do so for the forseeable future?

Mojo's compiler is not going to be magic. If you write matmul as a triply nested for loop, you will get a triply nested for loop on all hardwar…

View full answer

mojodojodev · 2023-05-16T08:33:38Z

mojodojodev
May 16, 2023

Yes this is exactly the problem that Modular is solving with both their inference engine and Mojo, this blog post is dedicated to your question: https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication

There hasn't been any detail on Polyhedral Optimizations specifically, but LLVM is being used so perhaps that's part of how they achieve the fastest matrix multiplication across different hardware types.

Did you see autotune in the matmul notebook:

 # Instead of hardcoding to tile_size = 4, search for the fastest 
 # tile size by evaluting this function as tile size varies.
 alias tile_size = autotune(1, 2, 4, 8, 16, 32)

This kind of powerful compile time metaprogramming could replace a lot of assembly targeting specific hardware.

What you're raising will be a big focus for people using Mojo and it's a large part of my interest in the language, so I expect a lot of content showing how to optimize across various hardware as it's released and gains popularity.

0 replies

prabhuramachandran · 2023-05-16T12:40:37Z

prabhuramachandran
May 16, 2023
Author

Thanks for your response @mojodojodev. I've read almost every scrap of information on Mojo thus far, so let me elaborate a little here. :)

Firstly, IIRC, the blog post exclusively talked about multi-core CPUs, and yes the results are very impressive. However the code used is not the same as the matmult notebook on the playground and in the documentation. Furthermore, the considerations for an accelerator are sometimes quite different. I am aware that MLIR has been used by folks to optimize matmult (and other things) on a GPU, but the story from the Mojo side on this is not clear.

Secondly, if I want to start using Mojo to build foundational pieces for scientific computation, it is important to know the right set of abstractions to use. The example code in the documentation uses some well known patterns and Mojo makes it very easy to express code in that fashion, i.e. making SIMD operations a fundamental building block, vectorization, and then tiling, auto-tuning (in addition to other optimizations) and finally parallelization across the cores. Making these first class building blocks at the language level is very powerful indeed. However, it is possible that there are architectures where this may not work as well. I do not know if that is the case. I am also curious if the Mojo developers believe that these are the right set of abstractions and think that that is how we should think about writing high-performance code. Hence my question about the guidelines.

Thirdly, in principle (and some have achieved this in practice with MLIR), one could go from a 5-line naive matmult implementation and automatically generate a highly optimized matrix multiplication code on multi-core CPUs (or GPUs) using techniques that the Mojo developers clearly know intimately (polyhedral optimizations come to mind and perhaps there are others). This then begs the question, should a developer bother at all about SIMD, vectorization, parallelization, tiling, autotuning, etc. when all of this will be automated away by Mojo? In which case, are those basic building blocks not how one should be thinking of writing code?

Finally, the reason I ask this question early is that I would like to have some broad idea of the general principles and guidelines so that I do not have to either wait until Mojo is fully released to start work or write code that I have to throw away later on. Thank you.

0 replies

Mogball · 2023-05-16T16:31:58Z

Mogball
May 16, 2023

Eventually, would Mojo allow for automatic optimization and parallelization of heavily numerical code via say polyhedral optimization? Would this be automatic or would this require writing the code in a specific way? The current matrix multiplication example provided in the documentation does require writing the code in a specific way, would this change so users could write it naively and expect similar performance as a specially hand-optimized version? Or is the approach used in the documentation the best way to do so for the forseeable future?

Mojo's compiler is not going to be magic. If you write matmul as a triply nested for loop, you will get a triply nested for loop on all hardwares (barring LLVM optimizations).

Thirdly, in principle (and some have achieved this in practice with MLIR), one could go from a 5-line naive matmult implementation and automatically generate a highly optimized matrix multiplication code on multi-core CPUs (or GPUs) using techniques that the Mojo developers clearly know intimately (polyhedral optimizations come to mind and perhaps there are others).

Mojo is a gateway to the whole MLIR ecosystem. It is entirely plausible that the matmul implementation for a particular piece of hardware just calls a few MLIR operations.

This then begs the question, should a developer bother at all about SIMD, vectorization, parallelization, tiling, autotuning, etc. when all of this will be automated away by Mojo? In which case, are those basic building blocks not how one should be thinking of writing code?

Finally, the reason I ask this question early is that I would like to have some broad idea of the general principles and guidelines so that I do not have to either wait until Mojo is fully released to start work or write code that I have to throw away later on. Thank you.

The general idea is that Mojo's compiler is not going to perform some magic to optimize the code you are generating, but the language provides all the facilities to write that magic in a portable way as just Mojo code. Today, that magic is bundled into a handful of higher-order functions, like parallelize and vectorize_unroll, and as time continues, Mojo will ship with more "batteries" that mean most developers won't have to worry about SIMD, unrolling, etc. You just need to slap a few decorators on your functions/loops and call a function.

3 replies

prabhuramachandran May 16, 2023
Author

Thank you for the clarifications.

Mojo's compiler is not going to be magic. If you write matmul as a triply nested for loop, you will get a triply nested for loop on all hardwares (barring LLVM optimizations).

That is good to know.

This then begs the question, should a developer bother at all about SIMD, vectorization, parallelization, tiling, autotuning, etc. when all of this will be automated away by Mojo? In which case, are those basic building blocks not how one should be thinking of writing code?

Finally, the reason I ask this question early is that I would like to have some broad idea of the general principles and guidelines so that I do not have to either wait until Mojo is fully released to start work or write code that I have to throw away later on. Thank you.

The general idea is that Mojo's compiler is not going to perform some magic to optimize the code you are generating, but the language provides all the facilities to write that magic in a portable way as just Mojo code. Today, that magic is bundled into a handful of higher-order functions, like parallelize and vectorize_unroll, and as time continues, Mojo will ship with more "batteries" that mean most developers won't have to worry about SIMD, unrolling, etc. You just need to slap a few decorators on your functions/loops and call a function.

Thank you for this clarification. So it does look like the patterns documented currently are the ones that need to be used with Mojo for high-performance computing on CPUs and GPUs. Do you have an idea if these same approaches will perform well on most accelerators today or do you expect there to be major changes?

Mogball May 16, 2023

The fundamental approach of how we build these higher-order functions and decorators will not change, but their implementation will grow more complex as we massage the system to work well on more hardware, of course.

prabhuramachandran May 16, 2023
Author

Got it, thank you. So for now your advice would be to use the advertized decorators and the approach that is currently documented. The expectation is that this will work seamlessly as the hardware support evolves behind the scenes. I will try that.

Any idea of when accelerator support will be available? Thank you again for patiently answering our questions.

lattner · 2023-05-17T01:25:52Z

lattner
May 17, 2023
Maintainer

Any idea of when accelerator support will be available? Thank you again for patiently answering our questions.

Unfortunately no, we can only say that we're working on accelerators and that is core to the mission, but can't talk about that until we're ready to talk about it :)

1 reply

prabhuramachandran May 17, 2023
Author

I understand and great to hear that accelerator support is on the cards. Thank you for doing all this. It is very exciting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidelines for writing scalable and portable Mojo code #164

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Guidelines for writing scalable and portable Mojo code #164

prabhuramachandran May 16, 2023

Replies: 4 comments · 4 replies

mojodojodev May 16, 2023

prabhuramachandran May 16, 2023 Author

Mogball May 16, 2023

prabhuramachandran May 16, 2023 Author

Mogball May 16, 2023

prabhuramachandran May 16, 2023 Author

lattner May 17, 2023 Maintainer

prabhuramachandran May 17, 2023 Author

prabhuramachandran
May 16, 2023

Replies: 4 comments 4 replies

mojodojodev
May 16, 2023

prabhuramachandran
May 16, 2023
Author

Mogball
May 16, 2023

prabhuramachandran May 16, 2023
Author

prabhuramachandran May 16, 2023
Author

lattner
May 17, 2023
Maintainer

prabhuramachandran May 17, 2023
Author