Guidelines for writing scalable and portable Mojo code #164
-
Would it be reasonable for someone to write Mojo code once for say a multi-core CPU and hope that the same code would perform well on an accelerator (perhaps with small modifications but without requiring a rewrite)? This isn't immediately clear from the documentation. It would really help to know if there are any specific guidelines for writing portable Mojo code that would scale well on different hardware. Is this even a reasonable expectation? Eventually, would Mojo allow for automatic optimization and parallelization of heavily numerical code via say polyhedral optimization? Would this be automatic or would this require writing the code in a specific way? The current matrix multiplication example provided in the documentation does require writing the code in a specific way, would this change so users could write it naively and expect similar performance as a specially hand-optimized version? Or is the approach used in the documentation the best way to do so for the forseeable future? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 4 replies
-
Yes this is exactly the problem that Modular is solving with both their inference engine and Mojo, this blog post is dedicated to your question: https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication There hasn't been any detail on Polyhedral Optimizations specifically, but LLVM is being used so perhaps that's part of how they achieve the fastest matrix multiplication across different hardware types. Did you see autotune in the matmul notebook:
This kind of powerful compile time metaprogramming could replace a lot of assembly targeting specific hardware. What you're raising will be a big focus for people using Mojo and it's a large part of my interest in the language, so I expect a lot of content showing how to optimize across various hardware as it's released and gains popularity. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your response @mojodojodev. I've read almost every scrap of information on Mojo thus far, so let me elaborate a little here. :) Firstly, IIRC, the blog post exclusively talked about multi-core CPUs, and yes the results are very impressive. However the code used is not the same as the matmult notebook on the playground and in the documentation. Furthermore, the considerations for an accelerator are sometimes quite different. I am aware that MLIR has been used by folks to optimize matmult (and other things) on a GPU, but the story from the Mojo side on this is not clear. Secondly, if I want to start using Mojo to build foundational pieces for scientific computation, it is important to know the right set of abstractions to use. The example code in the documentation uses some well known patterns and Mojo makes it very easy to express code in that fashion, i.e. making SIMD operations a fundamental building block, vectorization, and then tiling, auto-tuning (in addition to other optimizations) and finally parallelization across the cores. Making these first class building blocks at the language level is very powerful indeed. However, it is possible that there are architectures where this may not work as well. I do not know if that is the case. I am also curious if the Mojo developers believe that these are the right set of abstractions and think that that is how we should think about writing high-performance code. Hence my question about the guidelines. Thirdly, in principle (and some have achieved this in practice with MLIR), one could go from a 5-line naive matmult implementation and automatically generate a highly optimized matrix multiplication code on multi-core CPUs (or GPUs) using techniques that the Mojo developers clearly know intimately (polyhedral optimizations come to mind and perhaps there are others). This then begs the question, should a developer bother at all about SIMD, vectorization, parallelization, tiling, autotuning, etc. when all of this will be automated away by Mojo? In which case, are those basic building blocks not how one should be thinking of writing code? Finally, the reason I ask this question early is that I would like to have some broad idea of the general principles and guidelines so that I do not have to either wait until Mojo is fully released to start work or write code that I have to throw away later on. Thank you. |
Beta Was this translation helpful? Give feedback.
-
Mojo's compiler is not going to be magic. If you write matmul as a triply nested for loop, you will get a triply nested for loop on all hardwares (barring LLVM optimizations).
Mojo is a gateway to the whole MLIR ecosystem. It is entirely plausible that the matmul implementation for a particular piece of hardware just calls a few MLIR operations.
The general idea is that Mojo's compiler is not going to perform some magic to optimize the code you are generating, but the language provides all the facilities to write that magic in a portable way as just Mojo code. Today, that magic is bundled into a handful of higher-order functions, like |
Beta Was this translation helpful? Give feedback.
-
Unfortunately no, we can only say that we're working on accelerators and that is core to the mission, but can't talk about that until we're ready to talk about it :) |
Beta Was this translation helpful? Give feedback.
Mojo's compiler is not going to be magic. If you write matmul as a triply nested for loop, you will get a triply nested for loop on all hardwar…