build based on c05ffa9

JuliaFolds2 · Feb 6, 2024 · 9f44df2 · 9f44df2
1 parent 51b9a1f
commit 9f44df2
Show file tree

Hide file tree

Showing 11 changed files with 145 additions and 65 deletions.
diff --git a/previews/PR38/.documenter-siteinfo.json b/previews/PR38/.documenter-siteinfo.json
@@ -1 +1 @@
-{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-05T17:58:45","documenter_version":"1.2.1"}}
+{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-06T16:18:05","documenter_version":"1.2.1"}}
diff --git a/previews/PR38/examples/integration/integration/index.html b/previews/PR38/examples/integration/integration/index.html
@@ -22,4 +22,4 @@
 @btime trapezoidal(0, 1, $N);
 @btime trapezoidal_parallel(0, 1, $N);</code></pre><pre><code class="nohighlight hljs">  13.871 ms (0 allocations: 0 bytes)
   2.781 ms (38 allocations: 3.19 KiB)
-</code></pre><p>Because the problem is trivially parallel - all threads to the same thing and don&#39;t need to communicate - we expect an ideal speedup of (close to) the number of available threads.</p><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../juliaset/juliaset/">« Julia Set</a><a class="docs-footer-nextpage" href="../../../translation/">Translation Guide »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+</code></pre><p>Because the problem is trivially parallel - all threads to the same thing and don&#39;t need to communicate - we expect an ideal speedup of (close to) the number of available threads.</p><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../juliaset/juliaset/">« Julia Set</a><a class="docs-footer-nextpage" href="../../../translation/">Translation Guide »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
diff --git a/previews/PR38/examples/juliaset/juliaset/index.html b/previews/PR38/examples/juliaset/juliaset/index.html
@@ -52,4 +52,4 @@
   63.707 ms (39 allocations: 3.30 KiB)
 </code></pre><p>As hoped, the parallel implementation is faster. But can we improve the performance further?</p><h3 id="Tuning-nchunks"><a class="docs-heading-anchor" href="#Tuning-nchunks">Tuning <code>nchunks</code></a><a id="Tuning-nchunks-1"></a><a class="docs-heading-anchor-permalink" href="#Tuning-nchunks" title="Permalink"></a></h3><p>As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase <code>nchunks</code> to a value larger than <code>nthreads</code>. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia&#39;s scheduler) to balance the per-thread load.</p><pre><code class="language-julia hljs">@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;</code></pre><pre><code class="nohighlight hljs">  32.000 ms (12013 allocations: 1.14 MiB)
 </code></pre><p>Note that if we opt out of dynamic scheduling and set <code>schedule=:static</code>, this strategy doesn&#39;t help anymore (because chunks are naively distributed up front).</p><pre><code class="language-julia hljs">@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;</code></pre><pre><code class="nohighlight hljs">  63.439 ms (42 allocations: 3.37 KiB)
-</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../mc/mc/">« Parallel Monte Carlo</a><a class="docs-footer-nextpage" href="../../integration/integration/">Trapezoidal Integration »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../mc/mc/">« Parallel Monte Carlo</a><a class="docs-footer-nextpage" href="../../integration/integration/">Trapezoidal Integration »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
diff --git a/previews/PR38/examples/mc/mc/index.html b/previews/PR38/examples/mc/mc/index.html
@@ -51,4 +51,4 @@
 
 @btime mc($(length(idcs))) samples=10 evals=3;</code></pre><pre><code class="nohighlight hljs">  87.617 ms (0 allocations: 0 bytes)
   63.398 ms (0 allocations: 0 bytes)
-</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../../">« OhMyThreads</a><a class="docs-footer-nextpage" href="../../juliaset/juliaset/">Julia Set »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../../">« OhMyThreads</a><a class="docs-footer-nextpage" href="../../juliaset/juliaset/">Julia Set »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
diff --git a/previews/PR38/examples/tls/tls.jl b/previews/PR38/examples/tls/tls.jl
@@ -1,6 +1,16 @@
-using OhMyThreads: TaskLocalValue, tmap, chunks
+# # Task-Local Storage
+#
+# For some programs, it can be useful or even necessary to allocate and (re-)use memory in
+# your parallel code. The following section uses a simple example to explain how task-local
+# values can be efficiently created and (re-)used.
+#
+# ## Sequential
+#
+# Let's say that we are given two arrays of (square) matrices, `As` and `Bs`, and let's
+# further assume that our goal is to compute the total sum of all pairwise matrix products.
+# We can readily implement a (sequential) function that performs the necessary computations.
 using LinearAlgebra: mul!, BLAS
-using Base.Threads: nthreads, @spawn
+BLAS.set_num_threads(1) # for simplicity, we turn of OpenBLAS multithreading
 
 function matmulsums(As, Bs)
     N = size(first(As), 1)
@@ -11,6 +21,30 @@ function matmulsums(As, Bs)
     end
 end
 
+# Here, we use `map` to perform the desired operation for each pair of matrices,
+# `A` and `B`. However, the crucial point for our discussion is that we use the in-place
+# matrix multiplication `LinearAlgebra.mul!` in conjunction with a pre-allocated output
+# matrix `C`. This is to avoid the temporary allocation per "iteration" (i.e. per matrix
+# pair) that we would get with `C = A*B`.
+#
+# For later comparison, we generate some random input data and store the result.
+
+As = [rand(1024, 1024) for _ in 1:64]
+Bs = [rand(1024, 1024) for _ in 1:64]
+
+res = matmulsums(As, Bs);
+
+# ## Parallelization
+#
+# The key idea for creating a parallel version of `matmulsums` is to replace the `map` by
+# OhMyThreads' parallel [`tmap`](@ref) function. However, because we re-use `C`, this isn't
+# entirely trivial.
+#
+# ### The wrong way
+#
+# Someone new to parallel computing might be tempted to parallelize `matmulsums` like so:
+using OhMyThreads: tmap
+
 function matmulsums_race(As, Bs)
     N = size(first(As), 1)
     C = Matrix{Float64}(undef, N, N)
@@ -20,6 +54,20 @@ function matmulsums_race(As, Bs)
     end
 end
 
+# Unfortunately, this doesn't produce the correct result.
+
+res_race = matmulsums_race(As, Bs)
+res ≈ res_race
+
+# In fact, It doesn't even always produce the same result (check for yourself)!
+# The reason is that there is a race condition: different parallel
+# tasks are trying to use the shared variable `C` simultaneously leading to
+# non-deterministic behavior. Let's see how we can fix this.
+#
+# ### The naive (and inefficient) way
+#
+# A simple solution for the race condition issue above is to move the allocation of `C`
+# into the body of the parallel `tmap`:
 function matmulsums_naive(As, Bs)
     N = size(first(As), 1)
     tmap(As, Bs) do A, B
@@ -29,23 +77,61 @@ function matmulsums_naive(As, Bs)
     end
 end
 
+# In this case, a separate `C` will be allocated for each iteration such that parallel tasks
+# don't modify shared state anymore. Hence, we'll get the desired result.
+
+res_naive = matmulsums_naive(As, Bs)
+res ≈ res_naive
+
+# However, this variant is obviously inefficient because it is no better than just writing
+# `C = A*B` and thus leads to one allocation per matrix pair. We need a different way of
+# allocating and re-using `C` for an efficient parallel version.
+#
+# ## The right way: `TaskLocalValue`
+#
+# We've seen that we can't allocate `C` once up-front (→ race condition) and also shouldn't
+# allocate it within the `tmap` (→ one allocation per iteration). What we actually want is
+# to once allocate a separate `C` on each parallel task and then re-use this **task-local**
+# `C` for all iterations (i.e. matrix pairs) that said task is responsible for.
+#
+# The way to express this idea is `TaskLocalValue` and looks like this:
+using OhMyThreads: TaskLocalValue
+
 function matmulsums_tls(As, Bs)
     N = size(first(As), 1)
-    storage = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
+    tls = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
     tmap(As, Bs) do A, B
-        C = storage[]
+        C = tls[]
         mul!(C, A, B)
         sum(C)
     end
 end
 
+res_tls = matmulsums_tls(As, Bs)
+res ≈ res_tls
+
+# Here, `TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))` defines a
+# task-local storage `tls` that behaves like this: The first time the storage is accessed
+# (`tls[]`) from a task a task-local value is created according to the anonymous function
+# (here, the task-local value will be a matrix) and stored in the storage. Afterwards,
+# every other storage query from the same task(!) will simply return the task-local value.
+# Hence, this is precisely what we need and will only lead to O(# parallel tasks)
+# allocations.
+#
+# ## The performant but cumbersome way
+#
+# Before we benchmark and compare the performance of all discussed variants, let's implement
+# the idea of a task-local `C` for each parallel task manually.
+using OhMyThreads: chunks, @spawn
+using Base.Threads: nthreads
+
 function matmulsums_manual(As, Bs)
     N = size(first(As), 1)
     tasks = map(chunks(As; n = nthreads())) do idcs
         @spawn begin
             local C = Matrix{Float64}(undef, N, N)
             local results = Vector{Float64}(undef, length(idcs))
-            @inbounds for (i, idx) in enumerate(idcs)
+            for (i, idx) in enumerate(idcs)
                 mul!(C, As[idx], Bs[idx])
                 results[i] = sum(C)
             end
@@ -55,25 +141,28 @@ function matmulsums_manual(As, Bs)
     reduce(vcat, fetch.(tasks))
 end
 
-BLAS.set_num_threads(1) # to avoid potential oversubscription
-
-As = [rand(1024, 1024) for _ in 1:64]
-Bs = [rand(1024, 1024) for _ in 1:64]
-
-res = matmulsums(As, Bs)
-res_race = matmulsums_race(As, Bs)
-res_naive = matmulsums_naive(As, Bs)
-res_tls = matmulsums_tls(As, Bs)
 res_manual = matmulsums_manual(As, Bs)
-
-res ≈ res_race
-res ≈ res_naive
-res ≈ res_tls
 res ≈ res_manual
 
+# The first thing to note is pretty obvious: This is very cumbersome and you probably don't
+# want to write it. But let's take a closer look and see what's happening here.
+# First, we divide the number of matrix pairs into `nthreads()` chunks. Then, for each of
+# those chunks, we spawn a parallel task that (1) allocates a task-local `C` matrix (and a
+# `results` vector) and (2) performs the actual computations using these pre-allocated
+# values. Finally, we `fetch` the results of the tasks and combine them.
+#
+# ## Benchmark
+#
+# The whole point of parallelization is increasing performance, so let's benchmark and
+# compare the performance of the variants discussed above.
+
 using BenchmarkTools
 
 @btime matmulsums($As, $Bs);
 @btime matmulsums_naive($As, $Bs);
 @btime matmulsums_tls($As, $Bs);
 @btime matmulsums_manual($As, $Bs);
+
+# As we see, the recommened version `matmulsums_tls` is both convenient as well as
+# efficient: It allocates much less memory than `matmulsums_naive` and only slightly
+# more than the manual implementation.
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-05T17:58:45","documenter_version":"1.2.1"}}
		{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-06T16:18:05","documenter_version":"1.2.1"}}