Skip to content

Commit

Permalink
WIP: save notes.
Browse files Browse the repository at this point in the history
  • Loading branch information
Odd Kiva committed Jan 9, 2024
1 parent 083dd15 commit b78f254
Show file tree
Hide file tree
Showing 2 changed files with 174 additions and 62 deletions.
4 changes: 4 additions & 0 deletions doc/book/random/python_code_that_suck_less.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,10 @@ const auto& x_ijk = x[{i, j, k}];
This is aesthetically better and more meaningful than
```{cpp}
const auto& xijk = ...; // too crammed
// The following is fine too as typing the underscore symbol can be a hassle.
// I am not a snake_case zealot either...
const auto xi = x[i];
```

or the ugly
Expand Down
232 changes: 170 additions & 62 deletions doc/book/random/vector_intrinsics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@
Sounds boring? I promise you this is going to be way more interesting than you
might think. There's quite a bit interesting things to learn.

Story time before we dive into an existing example.
Story time before we dive into the implementation of the Gaussian blur.

I applied for a C++ technical leadership role once at some company. I was
## Story Time

Once I applied for a C++ technical leadership role for some company. I was
considered for the role after a preliminary behavioral screening by the CTO. He
then told me who I would be interviewing with in the next interviews. A few days
later, he called me back for a quick chat. He ended up telling me they would not
Expand All @@ -14,56 +16,85 @@ explaining why, I guessed that probably one of their senior engineers
disqualified me as he did care more about my ability in understanding and
manipulating CPU vector instructions.

The CTO profusedly apologized to me by saying that I certainly was gifted but my
image processing code was very slow because of my naive implementation. It did
sound unfair to me.
The CTO profusedly apologized to me. He said politely that I certainly was
gifted but my C++ code was not to their standard. From what I guessed, they
probably found that my image processing code was implemented too naively.

This was based on the fact that the CTO told me they were doing lots of
optimization involving CPU vector intrinsics in order to calculate data as fast
as possible. That it really was not that difficult and that I could get up to
speed fairly quickly.

I have mostly an applied math background, so it sounded unfair to me. It did
I have mostly an applied math background, so it did sound unfair to me. It did
made me feel that you are never enough whatever you achieve. Like you do need to
know every single aspect of engineering from high level to low level. In
hindsight I would not have been happy with the job anyways. Still frustrating...
In that moment, I was telling myself: what can you do when you are already being
shut the door? Going by the same logic, David Lowe showcased his SIFT research
code, he would not qualify for the job either: I learnt from studying his code.
shut the door? Going by the same logic, if David Lowe showcased his SIFT
research code, he would not qualify for the job either: I learnt from studying
his code.

Right, back to the topic: today how can we achieve that? Nowadays, we have some
answers with Halide. And we can do it very elegantly without coding in assembly
directly.

The main beauty with Halide is that you can decouple (1) the algorithm: the
separable convolution and (2) the scheduling strategy, i.e., how do you make it
as fast as the baseline: OpenCV's Gaussian blur?
## What is Halide?

With Halide you can write a an image processing filter and optimize the way it is run.

You write sets of arithmetic instructions that operates on image buffers with
specific parallelism patterns (multicore and vectorization). Then you can tell
to compile with a C++ method to generate the optimize the image filter as a C++
static library.

The main beauty with Halide is that you can decouple:

1. The algorithm: the separable convolution and,
2. the scheduling strategy: the parallelism strategy to make it as fast as the
baseline if not faster than OpenCV's Gaussian blur.

Halide can check and provide guarantees that your algorithm remains correct for
the schedule you are implementing.

Another beauty with Halide is that it abstracts CPU, GPU intrinsics for you in a
generic C++ code for a wide range of platforms. You don't have to be an expert
in CPU vector intrinsics, but you do need to know the scheduling strategies to
say optimize the speed at which your convolutional operations run. So unless you
read publications and the presentations, it is hard for the layman or the novice
to know what works and what doesn't.
With Halide, you won't write any nested `for` loops and multi-threaded
processing with bound-checking. So you can express ideas at a higher level.

Halide abstracts these parallelisms for you and supports a wide range of
platforms. You don't have to be an expert in CPU vector intrinsics, but you do
need to know the schedule strategies to say optimize the speed at which your
convolutional operations run. Halide has identified the most common schedule
patterns that are used to optimize image processing code.

## Naive Implementation of the Gaussian blur in C.
You still need to skim the publications and the presentations, practise. But in
my experience, it is still difficult for the layman or the novice to identify
the schedule patterns that work and those that don't.

Right, myself more than 15 years ago, let's imagine that I am a fresh graduate
with a Master's degree looking for an internship. I was a bit aware of CPU
registers but I can't see how it would fit in the grand scheme of things. I've
heard of vector instructions but never really. As an avid StackOverflow reader,
I have been told the big lie: don't prematurely optimize and just let the
compiler do its job. And just make things work.

I did learn the Gaussian filter is separable. Don't write the 2D convolution,
but exploit its separable properties. With the innocence of my youth, I am
proudly exhibiting a naive implementation in C++ in my Sara repo, something
along the lines:
## Naive C++ Implementation of the Gaussian Blur

Right, let's rewind back in time and imagine myself more than 15 years ago.
Freshly graduated with a Master's degree looking for an internship. I am a bit
aware of CPU registers but I can't see how it would fit in the grand scheme of
things. I've heard of vector instructions but never understood what they were
about. As an avid Internet reader, I have been sold the big lie: don't
prematurely optimize and just let the compiler do its job. And just make things
work.

I did learn in class that the Gaussian filter is separable: Don't write the 2D
convolution naively, exploit its separable property. I am proudly exhibiting a
naive implementation in C++ in my Sara repo, something along the lines:

Icing on the cake, I have discovered multicore parallelism via OpenMP.

```{Rcpp}
auto conv_x(const float *in, const float *kernel, float *out,
const int w, const int h, const int ksz) -> void
{
const auto r = ksz / 2;
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (auto y = 0; y < h; ++y)
{
for (auto x = 0; x < w; ++x)
Expand All @@ -88,11 +119,19 @@ auto conv_x(const float *in, const float *kernel, float *out,
}
}
}
```

We execute the same dance for the y-convolution:

```{Rcpp}
auto conv_y(const float *in, const float *kernel, float *out,
const int w, const int h, const int ksz) -> void
{
const auto r = ksz / 2;
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (auto y = 0; y < h; ++y)
{
for (auto x = 0; x < w; ++x)
Expand All @@ -119,27 +158,32 @@ auto conv_y(const float *in, const float *kernel, float *out,
}
```

I write unit tests, validate on synthetic tests, check the boundary conditions
and try it on a real image with a Gaussian kernel. I am happy it works
reasonably fast when compiling in Release mode on Visual Studio. Job done! I am
proudly showcasing my code on GitHub. Never complained about it as I bothered
about real-time issues as later I learnt about in CUDA and would write in CUDA
anyways.
I diligently write unit tests, validate on synthetic tests, check the boundary
conditions and try it on a real image with a Gaussian kernel. I am happy it
works reasonably fast when compiling in Release mode on Visual Studio. Job done!
I am proudly showcasing my code on GitHub. Never complained about it as I
bothered about real-time issues as later I learnt about in CUDA and would write
in CUDA anyways.


## 2D separable convolutional operation.
## Halide Implementation of the Algorithm

Many years later, the CTO and his minion tells you that you are not good enough
without elaborating what he was expecting. OK, let's make it better!
Back to the time where the CTO and his minions tell you that you are not good
enough without elaborating why and what he was expecting to see.

How? I vaguely understand you have to write the code with CPU intrinsics? Should
I write in C-style or in ASM. How to do it with limited bandwidth after
Fine! let's see how we could optimize the code...

OK How? I vaguely understand you have to write the code with CPU intrinsics?
Should I write in C-style or in ASM. How to do it with limited bandwidth after
finishing your day job and wanting to learn?

Use Halide! Halide is very elegant language. Here is how we could rewrite the 2D
separable convolution.
Use Halide? Yes! It is a very elegant language. It can also compile your code
directly in OpenCL and Vulkan, Direct3D bit code. Here is how we could rewrite
the 2D separable convolution.

```
Let us write the implementation first.

```{cpp}
#include <Halide.h>
auto x = Halide::Var{"x"};
Expand All @@ -155,31 +199,94 @@ auto tile = Halide::Var{"tile"};
auto input = Halide::Func{"input"};
auto kernel = Halide::Func{"kernel"};
static constexpr auto sigma = 3.f;
static constexpr auto truncation_factor = 4.f;
static constexpr auto ksz = static_cast<int>(2 * sigma * truncation_factor);
static constexpr auto kr = ksz / 2;
input(x, y) = x + y;
kernel(x) = Halide::exp(-x);
kernel(x) = Halide::exp(-x * x / (0.5f * sigma * sigma));
auto conv_x = Halide::Func{"conv_x"};
auto conv_y = Halide::Func{"conv_y"};
const auto ksz = 20;
auto k = Halide::RDom(0, ksz, "k");
// The algorithm
conv_x(x, y) = Halide::sum(input(x + k - ksz / 2, y) * kernel(k), "conv_x");
conv_y(x, y) = Halide::sum(conv_x(x, y + k - ksz / 2) * kernel(k), "conv_y");
conv_x(x, y) = Halide::sum(input(x + k - kr, y) * kernel(k));
conv_y(x, y) = Halide::sum(conv_x(x, y + k - kr / 2) * kernel(k));
```

The optimal scheduling is not that obvious in CPU.
## Shedule Optimization

### Two Types of Parallelisms on CPUs

The first obvious strategy to do is to calculate the x-convolution first. Then
calculate the transpose of the y-convoluved map and transpose it back to obtain
the final results.
There are two types of parallelisms on the CPU.

We can do a lot better by dividing the final convolved image into tiles of 64
by 64 pixels. Each image block is like a smaller independent image on which we
perform the 2D separable convolution.
1. Multicore processing:
This is straightforward to understand and is about keep all the CPU cores as
busy as possible with minimal data sharing. OpenMP is simple and helps to
parallelize image filter quite easily once we identify the parts of the
algorithm that operate independently.

2. Vector instructions:
Until I implemented filters with Halide, I could not understand what CPU
instrutions were really about.
I am not going to pretend to be an expert in CPU optimization but this little
paragraph should convince you why it is so interesting to apply vector
instructions wherever possible.
So, as a first approximation, a CPU vector instruction typically enables the
programmer to perform arithmetic operations on small vectors in a single CPU
cycle. Typically arithmetic operations such addition, multiplication and more
can operate on 4D vectors. That is where we can observe additional 4x speed
up or more if your CPU can process those operations on bigger vectors.

For more accurate and more comprehensive information, I will simply encourage
you to do your own research and share what you have learnt.

Like image filter, BLAS routines makes extensive use of the two types
parallelism on CPU platforms.

### Schedule 1

The obvious strategy is to start from the idea of separable convolution and
vectorize the convolution wherever possible.

```
kernel.compute_root();
conv_x
.compute_root()
.parallel(y)
.vectorize(x, 32, Halide::TailStrategy::GuardWithIf)
;
conv_y
.parallel(y)
.vectorize(x, 32, Halide::TailStrategy::GuardWithIf)
;
```

Then you decide to compete with OpenCV and you
realize that it crushes your implementation performance by being 2x faster.

In fact, the optimal schedule is really not obvious in CPU as exposed in Halide
presentations. Until you dig into Halide publications, you start to understand
how much work and expertise it is to optimize a typical image processing filter
in Photoshop and took 3 months of hard work to call CPU vector intrinsics.

The first step to achieve this to divide the final convolved image into tiles of
64 by 64 pixels. The set of tiles can be seen as an input batch of many smaller
images. The output is another batch of image tiles with the same image sizes
(let's just assume that the image width and height are multiple of 64.)

Because each output image tile is independent of each other, we can calculate
smaller x-convolution and y-convolution. For each tile we can fit the
x-convolution in the CPU cache and it improves data locality in the memory
cache. Then we explot the CPU vector instructions to calculate convolution in
batch.

### Schedule 2: Parallel tile processing, vectorize, CPU cache fitting

Scheduling
```
// The schedule
kernel.compute_root();
Expand All @@ -188,20 +295,19 @@ conv_y //
.tile(x, y, xo, yo, xi, yi, 64, 64)
.fuse(xo, yo, tile)
.parallel(tile)
// .parallel(yo)
.vectorize(xi, 4, Halide::TailStrategy::GuardWithIf) //
;
conv_x //
// .store_at(conv_y, tile)
conv_x
.compute_at(conv_y, xi) //
.vectorize(x, 4, Halide::TailStrategy::GuardWithIf) //
;
```

#### Second-Guessing what Halide does
There is a lot to unpack here. Let's break it down bit by bit.

The line
```
```{cpp}
conv_y.tile(x, y, xo, yo, xi, yi, 64, 64, Halide::TailStrategy::GuardWithIf)
.fuse(xo, yo, tile_index)
.parallel(tile_index);
Expand All @@ -211,7 +317,7 @@ The line
translates in C++ as

This is according to my second-guessing:
```
```{cpp}
#pragma omp parallel for // .parallel(y)
for (auto tile_index = 0; tile_index < T; ++T)
{
Expand Down Expand Up @@ -281,14 +387,16 @@ spend some time to learn how to actually read assembly code.
Halide developers has done a very good job to help us understand what assembly
code is generated mapping the assembly code to the pseudo code.

```
```{cpp}
# Generate code as follows
conv_y.compile_to_stmt("separable_conv_2d.stmt.html", {}, Halide::HTML);
```

It becomes clear that convolution operation operates in batch via fma operation.
Then it becomes clear that the convolution operation is implemented by batch
where we calculate 4, 8 or 16 convolved values at the same time by repeating
the vectorized fused multiply-add `fmla.4s` operation.


References:
- https://diveintosystems.org/book/C9-ARM64/common.html
- https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--?lang=en
- https://developer.arm.com/documentation/

0 comments on commit b78f254

Please sign in to comment.