Skip to content

Commit

Permalink
text
Browse files Browse the repository at this point in the history
  • Loading branch information
carstenbauer committed Feb 6, 2024
1 parent e346d5c commit ea1690d
Show file tree
Hide file tree
Showing 2 changed files with 266 additions and 39 deletions.
125 changes: 107 additions & 18 deletions docs/src/examples/tls/tls.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
using OhMyThreads: TaskLocalValue, tmap, chunks
# # Task-Local Storage
#
# For some programs, it can be useful or even necessary to allocate and (re-)use memory in
# your parallel code. The following section uses a simple example to explain how task-local
# values can be efficiently created and (re-)used.
#
# ## Sequential
#
# Let's say that we are given two arrays of (square) matrices, `As` and `Bs`, and let's
# further assume that our goal is to compute the total sum of all pairwise matrix products.
# We can readily implement a (sequential) function that performs the necessary computations.
using LinearAlgebra: mul!, BLAS
using Base.Threads: nthreads, @spawn
BLAS.set_num_threads(1) # for simplicity, we turn of OpenBLAS multithreading

function matmulsums(As, Bs)
N = size(first(As), 1)
Expand All @@ -11,6 +21,30 @@ function matmulsums(As, Bs)
end
end

# Here, we use `map` to perform the desired operation for each pair of matrices,
# `A` and `B`. However, the crucial point for our discussion is that we use the in-place
# matrix multiplication `LinearAlgebra.mul!` in conjunction with a pre-allocated output
# matrix `C`. This is to avoid the temporary allocation per "iteration" (i.e. per matrix
# pair) that we would get with `C = A*B`.
#
# For later comparison, we generate some random input data and store the result.

As = [rand(1024, 1024) for _ in 1:64]
Bs = [rand(1024, 1024) for _ in 1:64]

res = matmulsums(As, Bs);

# ## Parallelization
#
# The key idea for creating a parallel version of `matmulsums` is to replace the `map` by
# OhMyThreads' parallel [`tmap`](@ref) function. However, because we re-use `C`, this isn't
# entirely trivial.
#
# ### The wrong way
#
# Someone new to parallel computing might be tempted to parallelize `matmulsums` like so:
using OhMyThreads: tmap

function matmulsums_race(As, Bs)
N = size(first(As), 1)
C = Matrix{Float64}(undef, N, N)
Expand All @@ -20,6 +54,20 @@ function matmulsums_race(As, Bs)
end
end

# Unfortunately, this doesn't produce the correct result.

res_race = matmulsums_race(As, Bs)
res res_race

# In fact, It doesn't even always produce the same result (check for yourself)!
# The reason is that there is a race condition: different parallel
# tasks are trying to use the shared variable `C` simultaneously leading to
# non-deterministic behavior. Let's see how we can fix this.
#
# ### The naive (and inefficient) way
#
# A simple solution for the race condition issue above is to move the allocation of `C`
# into the body of the parallel `tmap`:
function matmulsums_naive(As, Bs)
N = size(first(As), 1)
tmap(As, Bs) do A, B
Expand All @@ -29,23 +77,61 @@ function matmulsums_naive(As, Bs)
end
end

# In this case, a separate `C` will be allocated for each iteration such that parallel tasks
# don't modify shared state anymore. Hence, we'll get the desired result.

res_naive = matmulsums_naive(As, Bs)
res res_naive

# However, this variant is obviously inefficient because it is no better than just writing
# `C = A*B` and thus leads to one allocation per matrix pair. We need a different way of
# allocating and re-using `C` for an efficient parallel version.
#
# ## The right way: `TaskLocalValue`
#
# We've seen that we can't allocate `C` once up-front (→ race condition) and also shouldn't
# allocate it within the `tmap` (→ one allocation per iteration). What we actually want is
# to once allocate a separate `C` on each parallel task and then re-use this **task-local**
# `C` for all iterations (i.e. matrix pairs) that said task is responsible for.
#
# The way to express this idea is `TaskLocalValue` and looks like this:
using OhMyThreads: TaskLocalValue

function matmulsums_tls(As, Bs)
N = size(first(As), 1)
storage = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
tls = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
tmap(As, Bs) do A, B
C = storage[]
C = tls[]
mul!(C, A, B)
sum(C)
end
end

res_tls = matmulsums_tls(As, Bs)
res res_tls

# Here, `TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))` defines a
# task-local storage `tls` that behaves like this: The first time the storage is accessed
# (`tls[]`) from a task a task-local value is created according to the anonymous function
# (here, the task-local value will be a matrix) and stored in the storage. Afterwards,
# every other storage query from the same task(!) will simply return the task-local value.
# Hence, this is precisely what we need and will only lead to O(# parallel tasks)
# allocations.
#
# ## The performant but cumbersome way
#
# Before we benchmark and compare the performance of all discussed variants, let's implement
# the idea of a task-local `C` for each parallel task manually.
using OhMyThreads: chunks, @spawn
using Base.Threads: nthreads

function matmulsums_manual(As, Bs)
N = size(first(As), 1)
tasks = map(chunks(As; n = nthreads())) do idcs
@spawn begin
local C = Matrix{Float64}(undef, N, N)
local results = Vector{Float64}(undef, length(idcs))
@inbounds for (i, idx) in enumerate(idcs)
for (i, idx) in enumerate(idcs)
mul!(C, As[idx], Bs[idx])
results[i] = sum(C)
end
Expand All @@ -55,25 +141,28 @@ function matmulsums_manual(As, Bs)
reduce(vcat, fetch.(tasks))
end

BLAS.set_num_threads(1) # to avoid potential oversubscription

As = [rand(1024, 1024) for _ in 1:64]
Bs = [rand(1024, 1024) for _ in 1:64]

res = matmulsums(As, Bs)
res_race = matmulsums_race(As, Bs)
res_naive = matmulsums_naive(As, Bs)
res_tls = matmulsums_tls(As, Bs)
res_manual = matmulsums_manual(As, Bs)

res res_race
res res_naive
res res_tls
res res_manual

# The first thing to note is pretty obvious: This is very cumbersome and you probably don't
# want to write it. But let's take a closer look and see what's happening here.
# First, we divide the number of matrix pairs into `nthreads()` chunks. Then, for each of
# those chunks, we spawn a parallel task that (1) allocates a task-local `C` matrix (and a
# `results` vector) and (2) performs the actual computations using these pre-allocated
# values. Finally, we `fetch` the results of the tasks and combine them.
#
# ## Benchmark
#
# The whole point of parallelization is increasing performance, so let's benchmark and
# compare the performance of the variants discussed above.

using BenchmarkTools

@btime matmulsums($As, $Bs);
@btime matmulsums_naive($As, $Bs);
@btime matmulsums_tls($As, $Bs);
@btime matmulsums_manual($As, $Bs);

# As we see, the recommened version `matmulsums_tls` is both convenient as well as
# efficient: It allocates much less memory than `matmulsums_naive` and only slightly
# more than the manual implementation.
Loading

0 comments on commit ea1690d

Please sign in to comment.