Skip to content

Commit

Permalink
build based on c05ffa9
Browse files Browse the repository at this point in the history
  • Loading branch information
Documenter.jl committed Feb 6, 2024
1 parent 51b9a1f commit 9f44df2
Show file tree
Hide file tree
Showing 11 changed files with 145 additions and 65 deletions.
2 changes: 1 addition & 1 deletion previews/PR38/.documenter-siteinfo.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-05T17:58:45","documenter_version":"1.2.1"}}
{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-06T16:18:05","documenter_version":"1.2.1"}}
2 changes: 1 addition & 1 deletion previews/PR38/examples/integration/integration/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@
@btime trapezoidal(0, 1, $N);
@btime trapezoidal_parallel(0, 1, $N);</code></pre><pre><code class="nohighlight hljs"> 13.871 ms (0 allocations: 0 bytes)
2.781 ms (38 allocations: 3.19 KiB)
</code></pre><p>Because the problem is trivially parallel - all threads to the same thing and don&#39;t need to communicate - we expect an ideal speedup of (close to) the number of available threads.</p><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../juliaset/juliaset/">« Julia Set</a><a class="docs-footer-nextpage" href="../../../translation/">Translation Guide »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
</code></pre><p>Because the problem is trivially parallel - all threads to the same thing and don&#39;t need to communicate - we expect an ideal speedup of (close to) the number of available threads.</p><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../juliaset/juliaset/">« Julia Set</a><a class="docs-footer-nextpage" href="../../../translation/">Translation Guide »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
2 changes: 1 addition & 1 deletion previews/PR38/examples/juliaset/juliaset/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,4 @@
63.707 ms (39 allocations: 3.30 KiB)
</code></pre><p>As hoped, the parallel implementation is faster. But can we improve the performance further?</p><h3 id="Tuning-nchunks"><a class="docs-heading-anchor" href="#Tuning-nchunks">Tuning <code>nchunks</code></a><a id="Tuning-nchunks-1"></a><a class="docs-heading-anchor-permalink" href="#Tuning-nchunks" title="Permalink"></a></h3><p>As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase <code>nchunks</code> to a value larger than <code>nthreads</code>. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia&#39;s scheduler) to balance the per-thread load.</p><pre><code class="language-julia hljs">@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;</code></pre><pre><code class="nohighlight hljs"> 32.000 ms (12013 allocations: 1.14 MiB)
</code></pre><p>Note that if we opt out of dynamic scheduling and set <code>schedule=:static</code>, this strategy doesn&#39;t help anymore (because chunks are naively distributed up front).</p><pre><code class="language-julia hljs">@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;</code></pre><pre><code class="nohighlight hljs"> 63.439 ms (42 allocations: 3.37 KiB)
</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../mc/mc/">« Parallel Monte Carlo</a><a class="docs-footer-nextpage" href="../../integration/integration/">Trapezoidal Integration »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../mc/mc/">« Parallel Monte Carlo</a><a class="docs-footer-nextpage" href="../../integration/integration/">Trapezoidal Integration »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
2 changes: 1 addition & 1 deletion previews/PR38/examples/mc/mc/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -51,4 +51,4 @@

@btime mc($(length(idcs))) samples=10 evals=3;</code></pre><pre><code class="nohighlight hljs"> 87.617 ms (0 allocations: 0 bytes)
63.398 ms (0 allocations: 0 bytes)
</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../../">« OhMyThreads</a><a class="docs-footer-nextpage" href="../../juliaset/juliaset/">Julia Set »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../../">« OhMyThreads</a><a class="docs-footer-nextpage" href="../../juliaset/juliaset/">Julia Set »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
125 changes: 107 additions & 18 deletions previews/PR38/examples/tls/tls.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
using OhMyThreads: TaskLocalValue, tmap, chunks
# # Task-Local Storage
#
# For some programs, it can be useful or even necessary to allocate and (re-)use memory in
# your parallel code. The following section uses a simple example to explain how task-local
# values can be efficiently created and (re-)used.
#
# ## Sequential
#
# Let's say that we are given two arrays of (square) matrices, `As` and `Bs`, and let's
# further assume that our goal is to compute the total sum of all pairwise matrix products.
# We can readily implement a (sequential) function that performs the necessary computations.
using LinearAlgebra: mul!, BLAS
using Base.Threads: nthreads, @spawn
BLAS.set_num_threads(1) # for simplicity, we turn of OpenBLAS multithreading

function matmulsums(As, Bs)
N = size(first(As), 1)
Expand All @@ -11,6 +21,30 @@ function matmulsums(As, Bs)
end
end

# Here, we use `map` to perform the desired operation for each pair of matrices,
# `A` and `B`. However, the crucial point for our discussion is that we use the in-place
# matrix multiplication `LinearAlgebra.mul!` in conjunction with a pre-allocated output
# matrix `C`. This is to avoid the temporary allocation per "iteration" (i.e. per matrix
# pair) that we would get with `C = A*B`.
#
# For later comparison, we generate some random input data and store the result.

As = [rand(1024, 1024) for _ in 1:64]
Bs = [rand(1024, 1024) for _ in 1:64]

res = matmulsums(As, Bs);

# ## Parallelization
#
# The key idea for creating a parallel version of `matmulsums` is to replace the `map` by
# OhMyThreads' parallel [`tmap`](@ref) function. However, because we re-use `C`, this isn't
# entirely trivial.
#
# ### The wrong way
#
# Someone new to parallel computing might be tempted to parallelize `matmulsums` like so:
using OhMyThreads: tmap

function matmulsums_race(As, Bs)
N = size(first(As), 1)
C = Matrix{Float64}(undef, N, N)
Expand All @@ -20,6 +54,20 @@ function matmulsums_race(As, Bs)
end
end

# Unfortunately, this doesn't produce the correct result.

res_race = matmulsums_race(As, Bs)
res res_race

# In fact, It doesn't even always produce the same result (check for yourself)!
# The reason is that there is a race condition: different parallel
# tasks are trying to use the shared variable `C` simultaneously leading to
# non-deterministic behavior. Let's see how we can fix this.
#
# ### The naive (and inefficient) way
#
# A simple solution for the race condition issue above is to move the allocation of `C`
# into the body of the parallel `tmap`:
function matmulsums_naive(As, Bs)
N = size(first(As), 1)
tmap(As, Bs) do A, B
Expand All @@ -29,23 +77,61 @@ function matmulsums_naive(As, Bs)
end
end

# In this case, a separate `C` will be allocated for each iteration such that parallel tasks
# don't modify shared state anymore. Hence, we'll get the desired result.

res_naive = matmulsums_naive(As, Bs)
res res_naive

# However, this variant is obviously inefficient because it is no better than just writing
# `C = A*B` and thus leads to one allocation per matrix pair. We need a different way of
# allocating and re-using `C` for an efficient parallel version.
#
# ## The right way: `TaskLocalValue`
#
# We've seen that we can't allocate `C` once up-front (→ race condition) and also shouldn't
# allocate it within the `tmap` (→ one allocation per iteration). What we actually want is
# to once allocate a separate `C` on each parallel task and then re-use this **task-local**
# `C` for all iterations (i.e. matrix pairs) that said task is responsible for.
#
# The way to express this idea is `TaskLocalValue` and looks like this:
using OhMyThreads: TaskLocalValue

function matmulsums_tls(As, Bs)
N = size(first(As), 1)
storage = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
tls = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
tmap(As, Bs) do A, B
C = storage[]
C = tls[]
mul!(C, A, B)
sum(C)
end
end

res_tls = matmulsums_tls(As, Bs)
res res_tls

# Here, `TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))` defines a
# task-local storage `tls` that behaves like this: The first time the storage is accessed
# (`tls[]`) from a task a task-local value is created according to the anonymous function
# (here, the task-local value will be a matrix) and stored in the storage. Afterwards,
# every other storage query from the same task(!) will simply return the task-local value.
# Hence, this is precisely what we need and will only lead to O(# parallel tasks)
# allocations.
#
# ## The performant but cumbersome way
#
# Before we benchmark and compare the performance of all discussed variants, let's implement
# the idea of a task-local `C` for each parallel task manually.
using OhMyThreads: chunks, @spawn
using Base.Threads: nthreads

function matmulsums_manual(As, Bs)
N = size(first(As), 1)
tasks = map(chunks(As; n = nthreads())) do idcs
@spawn begin
local C = Matrix{Float64}(undef, N, N)
local results = Vector{Float64}(undef, length(idcs))
@inbounds for (i, idx) in enumerate(idcs)
for (i, idx) in enumerate(idcs)
mul!(C, As[idx], Bs[idx])
results[i] = sum(C)
end
Expand All @@ -55,25 +141,28 @@ function matmulsums_manual(As, Bs)
reduce(vcat, fetch.(tasks))
end

BLAS.set_num_threads(1) # to avoid potential oversubscription

As = [rand(1024, 1024) for _ in 1:64]
Bs = [rand(1024, 1024) for _ in 1:64]

res = matmulsums(As, Bs)
res_race = matmulsums_race(As, Bs)
res_naive = matmulsums_naive(As, Bs)
res_tls = matmulsums_tls(As, Bs)
res_manual = matmulsums_manual(As, Bs)

res res_race
res res_naive
res res_tls
res res_manual

# The first thing to note is pretty obvious: This is very cumbersome and you probably don't
# want to write it. But let's take a closer look and see what's happening here.
# First, we divide the number of matrix pairs into `nthreads()` chunks. Then, for each of
# those chunks, we spawn a parallel task that (1) allocates a task-local `C` matrix (and a
# `results` vector) and (2) performs the actual computations using these pre-allocated
# values. Finally, we `fetch` the results of the tasks and combine them.
#
# ## Benchmark
#
# The whole point of parallelization is increasing performance, so let's benchmark and
# compare the performance of the variants discussed above.

using BenchmarkTools

@btime matmulsums($As, $Bs);
@btime matmulsums_naive($As, $Bs);
@btime matmulsums_tls($As, $Bs);
@btime matmulsums_manual($As, $Bs);

# As we see, the recommened version `matmulsums_tls` is both convenient as well as
# efficient: It allocates much less memory than `matmulsums_naive` and only slightly
# more than the manual implementation.
Loading

0 comments on commit 9f44df2

Please sign in to comment.