From 9f44df26ea42af6037a849529ead6f0985002c84 Mon Sep 17 00:00:00 2001
From: "Documenter.jl" <documenter@juliadocs.github.io>
Date: Tue, 6 Feb 2024 16:18:08 +0000
Subject: [PATCH] build based on c05ffa9

---
 previews/PR38/.documenter-siteinfo.json       |   2 +-
 .../integration/integration/index.html        |   2 +-
 .../examples/juliaset/juliaset/index.html     |   2 +-
 previews/PR38/examples/mc/mc/index.html       |   2 +-
 previews/PR38/examples/tls/tls.jl             | 125 +++++++++++++++---
 previews/PR38/examples/tls/tls/index.html     |  55 ++++----
 previews/PR38/index.html                      |   2 +-
 previews/PR38/refs/api/index.html             |  14 +-
 previews/PR38/refs/internal/index.html        |   2 +-
 previews/PR38/search_index.js                 |   2 +-
 previews/PR38/translation/index.html          |   2 +-
 11 files changed, 145 insertions(+), 65 deletions(-)
diff --git a/previews/PR38/.documenter-siteinfo.json b/previews/PR38/.documenter-siteinfo.json
index 5e928f92..0b780938 100644
--- a/previews/PR38/.documenter-siteinfo.json
+++ b/previews/PR38/.documenter-siteinfo.json
@@ -1 +1 @@
-{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-05T17:58:45","documenter_version":"1.2.1"}}
\ No newline at end of file
+{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-06T16:18:05","documenter_version":"1.2.1"}}
\ No newline at end of file
diff --git a/previews/PR38/examples/integration/integration/index.html b/previews/PR38/examples/integration/integration/index.html
index f78abf7e..8f554fca 100644
--- a/previews/PR38/examples/integration/integration/index.html
+++ b/previews/PR38/examples/integration/integration/index.html
@@ -22,4 +22,4 @@
 @btime trapezoidal(0, 1, $N);
 @btime trapezoidal_parallel(0, 1, $N);</code></pre><pre><code class="nohighlight hljs">  13.871 ms (0 allocations: 0 bytes)
   2.781 ms (38 allocations: 3.19 KiB)
-</code></pre><p>Because the problem is trivially parallel - all threads to the same thing and don&#39;t need to communicate - we expect an ideal speedup of (close to) the number of available threads.</p><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../juliaset/juliaset/">« Julia Set</a><a class="docs-footer-nextpage" href="../../../translation/">Translation Guide »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+</code></pre><p>Because the problem is trivially parallel - all threads to the same thing and don&#39;t need to communicate - we expect an ideal speedup of (close to) the number of available threads.</p><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../juliaset/juliaset/">« Julia Set</a><a class="docs-footer-nextpage" href="../../../translation/">Translation Guide »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
diff --git a/previews/PR38/examples/juliaset/juliaset/index.html b/previews/PR38/examples/juliaset/juliaset/index.html
index acf270d2..e335dd45 100644
--- a/previews/PR38/examples/juliaset/juliaset/index.html
+++ b/previews/PR38/examples/juliaset/juliaset/index.html
@@ -52,4 +52,4 @@
   63.707 ms (39 allocations: 3.30 KiB)
 </code></pre><p>As hoped, the parallel implementation is faster. But can we improve the performance further?</p><h3 id="Tuning-nchunks"><a class="docs-heading-anchor" href="#Tuning-nchunks">Tuning <code>nchunks</code></a><a id="Tuning-nchunks-1"></a><a class="docs-heading-anchor-permalink" href="#Tuning-nchunks" title="Permalink"></a></h3><p>As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase <code>nchunks</code> to a value larger than <code>nthreads</code>. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia&#39;s scheduler) to balance the per-thread load.</p><pre><code class="language-julia hljs">@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;</code></pre><pre><code class="nohighlight hljs">  32.000 ms (12013 allocations: 1.14 MiB)
 </code></pre><p>Note that if we opt out of dynamic scheduling and set <code>schedule=:static</code>, this strategy doesn&#39;t help anymore (because chunks are naively distributed up front).</p><pre><code class="language-julia hljs">@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;</code></pre><pre><code class="nohighlight hljs">  63.439 ms (42 allocations: 3.37 KiB)
-</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../mc/mc/">« Parallel Monte Carlo</a><a class="docs-footer-nextpage" href="../../integration/integration/">Trapezoidal Integration »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../mc/mc/">« Parallel Monte Carlo</a><a class="docs-footer-nextpage" href="../../integration/integration/">Trapezoidal Integration »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
diff --git a/previews/PR38/examples/mc/mc/index.html b/previews/PR38/examples/mc/mc/index.html
index 795fd192..7c5fa89f 100644
--- a/previews/PR38/examples/mc/mc/index.html
+++ b/previews/PR38/examples/mc/mc/index.html
@@ -51,4 +51,4 @@
 
 @btime mc($(length(idcs))) samples=10 evals=3;</code></pre><pre><code class="nohighlight hljs">  87.617 ms (0 allocations: 0 bytes)
   63.398 ms (0 allocations: 0 bytes)
-</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../../">« OhMyThreads</a><a class="docs-footer-nextpage" href="../../juliaset/juliaset/">Julia Set »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../../">« OhMyThreads</a><a class="docs-footer-nextpage" href="../../juliaset/juliaset/">Julia Set »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
diff --git a/previews/PR38/examples/tls/tls.jl b/previews/PR38/examples/tls/tls.jl
index d3f88c8b..f0de5de6 100644
--- a/previews/PR38/examples/tls/tls.jl
+++ b/previews/PR38/examples/tls/tls.jl
@@ -1,6 +1,16 @@
-using OhMyThreads: TaskLocalValue, tmap, chunks
+# # Task-Local Storage
+#
+# For some programs, it can be useful or even necessary to allocate and (re-)use memory in
+# your parallel code. The following section uses a simple example to explain how task-local
+# values can be efficiently created and (re-)used.
+#
+# ## Sequential
+#
+# Let's say that we are given two arrays of (square) matrices, `As` and `Bs`, and let's
+# further assume that our goal is to compute the total sum of all pairwise matrix products.
+# We can readily implement a (sequential) function that performs the necessary computations.
 using LinearAlgebra: mul!, BLAS
-using Base.Threads: nthreads, @spawn
+BLAS.set_num_threads(1) # for simplicity, we turn of OpenBLAS multithreading
 
 function matmulsums(As, Bs)
     N = size(first(As), 1)
@@ -11,6 +21,30 @@ function matmulsums(As, Bs)
     end
 end
 
+# Here, we use `map` to perform the desired operation for each pair of matrices,
+# `A` and `B`. However, the crucial point for our discussion is that we use the in-place
+# matrix multiplication `LinearAlgebra.mul!` in conjunction with a pre-allocated output
+# matrix `C`. This is to avoid the temporary allocation per "iteration" (i.e. per matrix
+# pair) that we would get with `C = A*B`.
+#
+# For later comparison, we generate some random input data and store the result.
+
+As = [rand(1024, 1024) for _ in 1:64]
+Bs = [rand(1024, 1024) for _ in 1:64]
+
+res = matmulsums(As, Bs);
+
+# ## Parallelization
+#
+# The key idea for creating a parallel version of `matmulsums` is to replace the `map` by
+# OhMyThreads' parallel [`tmap`](@ref) function. However, because we re-use `C`, this isn't
+# entirely trivial.
+#
+# ### The wrong way
+#
+# Someone new to parallel computing might be tempted to parallelize `matmulsums` like so:
+using OhMyThreads: tmap
+
 function matmulsums_race(As, Bs)
     N = size(first(As), 1)
     C = Matrix{Float64}(undef, N, N)
@@ -20,6 +54,20 @@ function matmulsums_race(As, Bs)
     end
 end
 
+# Unfortunately, this doesn't produce the correct result.
+
+res_race = matmulsums_race(As, Bs)
+res ≈ res_race
+
+# In fact, It doesn't even always produce the same result (check for yourself)!
+# The reason is that there is a race condition: different parallel
+# tasks are trying to use the shared variable `C` simultaneously leading to
+# non-deterministic behavior. Let's see how we can fix this.
+#
+# ### The naive (and inefficient) way
+#
+# A simple solution for the race condition issue above is to move the allocation of `C`
+# into the body of the parallel `tmap`:
 function matmulsums_naive(As, Bs)
     N = size(first(As), 1)
     tmap(As, Bs) do A, B
@@ -29,23 +77,61 @@ function matmulsums_naive(As, Bs)
     end
 end
 
+# In this case, a separate `C` will be allocated for each iteration such that parallel tasks
+# don't modify shared state anymore. Hence, we'll get the desired result.
+
+res_naive = matmulsums_naive(As, Bs)
+res ≈ res_naive
+
+# However, this variant is obviously inefficient because it is no better than just writing
+# `C = A*B` and thus leads to one allocation per matrix pair. We need a different way of
+# allocating and re-using `C` for an efficient parallel version.
+#
+# ## The right way: `TaskLocalValue`
+#
+# We've seen that we can't allocate `C` once up-front (→ race condition) and also shouldn't
+# allocate it within the `tmap` (→ one allocation per iteration). What we actually want is
+# to once allocate a separate `C` on each parallel task and then re-use this **task-local**
+# `C` for all iterations (i.e. matrix pairs) that said task is responsible for.
+#
+# The way to express this idea is `TaskLocalValue` and looks like this:
+using OhMyThreads: TaskLocalValue
+
 function matmulsums_tls(As, Bs)
     N = size(first(As), 1)
-    storage = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
+    tls = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
     tmap(As, Bs) do A, B
-        C = storage[]
+        C = tls[]
         mul!(C, A, B)
         sum(C)
     end
 end
 
+res_tls = matmulsums_tls(As, Bs)
+res ≈ res_tls
+
+# Here, `TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))` defines a
+# task-local storage `tls` that behaves like this: The first time the storage is accessed
+# (`tls[]`) from a task a task-local value is created according to the anonymous function
+# (here, the task-local value will be a matrix) and stored in the storage. Afterwards,
+# every other storage query from the same task(!) will simply return the task-local value.
+# Hence, this is precisely what we need and will only lead to O(# parallel tasks)
+# allocations.
+#
+# ## The performant but cumbersome way
+#
+# Before we benchmark and compare the performance of all discussed variants, let's implement
+# the idea of a task-local `C` for each parallel task manually.
+using OhMyThreads: chunks, @spawn
+using Base.Threads: nthreads
+
 function matmulsums_manual(As, Bs)
     N = size(first(As), 1)
     tasks = map(chunks(As; n = nthreads())) do idcs
         @spawn begin
             local C = Matrix{Float64}(undef, N, N)
             local results = Vector{Float64}(undef, length(idcs))
-            @inbounds for (i, idx) in enumerate(idcs)
+            for (i, idx) in enumerate(idcs)
                 mul!(C, As[idx], Bs[idx])
                 results[i] = sum(C)
             end
@@ -55,25 +141,28 @@ function matmulsums_manual(As, Bs)
     reduce(vcat, fetch.(tasks))
 end
 
-BLAS.set_num_threads(1) # to avoid potential oversubscription
-
-As = [rand(1024, 1024) for _ in 1:64]
-Bs = [rand(1024, 1024) for _ in 1:64]
-
-res = matmulsums(As, Bs)
-res_race = matmulsums_race(As, Bs)
-res_naive = matmulsums_naive(As, Bs)
-res_tls = matmulsums_tls(As, Bs)
 res_manual = matmulsums_manual(As, Bs)
-
-res ≈ res_race
-res ≈ res_naive
-res ≈ res_tls
 res ≈ res_manual
 
+# The first thing to note is pretty obvious: This is very cumbersome and you probably don't
+# want to write it. But let's take a closer look and see what's happening here.
+# First, we divide the number of matrix pairs into `nthreads()` chunks. Then, for each of
+# those chunks, we spawn a parallel task that (1) allocates a task-local `C` matrix (and a
+# `results` vector) and (2) performs the actual computations using these pre-allocated
+# values. Finally, we `fetch` the results of the tasks and combine them.
+#
+# ## Benchmark
+#
+# The whole point of parallelization is increasing performance, so let's benchmark and
+# compare the performance of the variants discussed above.
+
 using BenchmarkTools
 
 @btime matmulsums($As, $Bs);
 @btime matmulsums_naive($As, $Bs);
 @btime matmulsums_tls($As, $Bs);
 @btime matmulsums_manual($As, $Bs);
+
+# As we see, the recommened version `matmulsums_tls` is both convenient as well as
+# efficient: It allocates much less memory than `matmulsums_naive` and only slightly
+# more than the manual implementation.
diff --git a/previews/PR38/examples/tls/tls/index.html b/previews/PR38/examples/tls/tls/index.html
index 26935816..29cc0311 100644
--- a/previews/PR38/examples/tls/tls/index.html
+++ b/previews/PR38/examples/tls/tls/index.html
@@ -1,7 +1,6 @@
 <!DOCTYPE html>
-<html lang="en"><head><meta charset="UTF-8"/><meta name="viewport" content="width=device-width, initial-scale=1.0"/><title>Task-Local Storage · OhMyThreads.jl</title><meta name="title" content="Task-Local Storage · OhMyThreads.jl"/><meta property="og:title" content="Task-Local Storage · OhMyThreads.jl"/><meta property="twitter:title" content="Task-Local Storage · OhMyThreads.jl"/><meta name="description" content="Documentation for OhMyThreads.jl."/><meta property="og:description" content="Documentation for OhMyThreads.jl."/><meta property="twitter:description" content="Documentation for OhMyThreads.jl."/><script data-outdated-warner src="../../../assets/warner.js"></script><link href="https://cdnjs.cloudflare.com/ajax/libs/lato-font/3.0.0/css/lato-font.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/juliamono/0.050/juliamono.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/fontawesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/solid.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/brands.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.8/katex.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL="../../.."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.6/require.min.js" data-main="../../../assets/documenter.js"></script><script src="../../../search_index.js"></script><script src="../../../siteinfo.js"></script><script src="../../../../versions.js"></script><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../../assets/themes/documenter-dark.css" data-theme-name="documenter-dark" data-theme-primary-dark/><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../../assets/themes/documenter-light.css" data-theme-name="documenter-light" data-theme-primary/><script src="../../../assets/themeswap.js"></script></head><body><div id="documenter"><nav class="docs-sidebar"><div class="docs-package-name"><span class="docs-autofit"><a href="../../../">OhMyThreads.jl</a></span></div><button class="docs-search-query input is-rounded is-small is-clickable my-2 mx-auto py-1 px-2" id="documenter-search-query">Search docs (Ctrl + /)</button><ul class="docs-menu"><li><a class="tocitem" href="../../../">OhMyThreads</a></li><li><input class="collapse-toggle" id="menuitem-2" type="checkbox"/><label class="tocitem" for="menuitem-2"><span class="docs-label">Examples</span><i class="docs-chevron"></i></label><ul class="collapsed"><li><a class="tocitem" href="../../mc/mc/">Parallel Monte Carlo</a></li><li><a class="tocitem" href="../../juliaset/juliaset/">Julia Set</a></li><li><a class="tocitem" href="../../integration/integration/">Trapezoidal Integration</a></li></ul></li><li><a class="tocitem" href="../../../translation/">Translation Guide</a></li><li class="is-active"><a class="tocitem" href>Task-Local Storage</a></li><li><input class="collapse-toggle" id="menuitem-5" type="checkbox"/><label class="tocitem" for="menuitem-5"><span class="docs-label">References</span><i class="docs-chevron"></i></label><ul class="collapsed"><li><a class="tocitem" href="../../../refs/api/">Public API</a></li><li><a class="tocitem" href="../../../refs/internal/">Internal</a></li></ul></li></ul><div class="docs-version-selector field has-addons"><div class="control"><span class="docs-label button is-static is-size-7">Version</span></div><div class="docs-selector control is-expanded"><div class="select is-fullwidth is-size-7"><select id="documenter-version-selector"></select></div></div></div></nav><div class="docs-main"><header class="docs-navbar"><a class="docs-sidebar-button docs-navbar-link fa-solid fa-bars is-hidden-desktop" id="documenter-sidebar-button" href="#"></a><nav class="breadcrumb"><ul class="is-hidden-mobile"><li class="is-active"><a href>Task-Local Storage</a></li></ul><ul class="is-hidden-tablet"><li class="is-active"><a href>Task-Local Storage</a></li></ul></nav><div class="docs-right"><a class="docs-navbar-link" href="https://github.com/JuliaFolds2/OhMyThreads.jl" title="View the repository on GitHub"><span class="docs-icon fa-brands"></span><span class="docs-label is-hidden-touch">GitHub</span></a><a class="docs-navbar-link" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/master/docs/src/examples/tls/tls.jl#" title="Edit source on GitHub"><span class="docs-icon fa-solid"></span></a><a class="docs-settings-button docs-navbar-link fa-solid fa-gear" id="documenter-settings-button" href="#" title="Settings"></a><a class="docs-article-toggle-button fa-solid fa-chevron-up" id="documenter-article-toggle-button" href="javascript:;" title="Collapse all docstrings"></a></div></header><article class="content" id="documenter-page"><pre><code class="language-julia hljs">using OhMyThreads: TaskLocalValue, tmap, chunks
-using LinearAlgebra: mul!, BLAS
-using Base.Threads: nthreads, @spawn
+<html lang="en"><head><meta charset="UTF-8"/><meta name="viewport" content="width=device-width, initial-scale=1.0"/><title>Task-Local Storage · OhMyThreads.jl</title><meta name="title" content="Task-Local Storage · OhMyThreads.jl"/><meta property="og:title" content="Task-Local Storage · OhMyThreads.jl"/><meta property="twitter:title" content="Task-Local Storage · OhMyThreads.jl"/><meta name="description" content="Documentation for OhMyThreads.jl."/><meta property="og:description" content="Documentation for OhMyThreads.jl."/><meta property="twitter:description" content="Documentation for OhMyThreads.jl."/><script data-outdated-warner src="../../../assets/warner.js"></script><link href="https://cdnjs.cloudflare.com/ajax/libs/lato-font/3.0.0/css/lato-font.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/juliamono/0.050/juliamono.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/fontawesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/solid.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/brands.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.8/katex.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL="../../.."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.6/require.min.js" data-main="../../../assets/documenter.js"></script><script src="../../../search_index.js"></script><script src="../../../siteinfo.js"></script><script src="../../../../versions.js"></script><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../../assets/themes/documenter-dark.css" data-theme-name="documenter-dark" data-theme-primary-dark/><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../../assets/themes/documenter-light.css" data-theme-name="documenter-light" data-theme-primary/><script src="../../../assets/themeswap.js"></script></head><body><div id="documenter"><nav class="docs-sidebar"><div class="docs-package-name"><span class="docs-autofit"><a href="../../../">OhMyThreads.jl</a></span></div><button class="docs-search-query input is-rounded is-small is-clickable my-2 mx-auto py-1 px-2" id="documenter-search-query">Search docs (Ctrl + /)</button><ul class="docs-menu"><li><a class="tocitem" href="../../../">OhMyThreads</a></li><li><input class="collapse-toggle" id="menuitem-2" type="checkbox"/><label class="tocitem" for="menuitem-2"><span class="docs-label">Examples</span><i class="docs-chevron"></i></label><ul class="collapsed"><li><a class="tocitem" href="../../mc/mc/">Parallel Monte Carlo</a></li><li><a class="tocitem" href="../../juliaset/juliaset/">Julia Set</a></li><li><a class="tocitem" href="../../integration/integration/">Trapezoidal Integration</a></li></ul></li><li><a class="tocitem" href="../../../translation/">Translation Guide</a></li><li class="is-active"><a class="tocitem" href>Task-Local Storage</a><ul class="internal"><li><a class="tocitem" href="#Sequential"><span>Sequential</span></a></li><li><a class="tocitem" href="#Parallelization"><span>Parallelization</span></a></li><li><a class="tocitem" href="#The-right-way:-TaskLocalValue"><span>The right way: <code>TaskLocalValue</code></span></a></li><li><a class="tocitem" href="#The-performant-but-cumbersome-way"><span>The performant but cumbersome way</span></a></li><li><a class="tocitem" href="#Benchmark"><span>Benchmark</span></a></li></ul></li><li><input class="collapse-toggle" id="menuitem-5" type="checkbox"/><label class="tocitem" for="menuitem-5"><span class="docs-label">References</span><i class="docs-chevron"></i></label><ul class="collapsed"><li><a class="tocitem" href="../../../refs/api/">Public API</a></li><li><a class="tocitem" href="../../../refs/internal/">Internal</a></li></ul></li></ul><div class="docs-version-selector field has-addons"><div class="control"><span class="docs-label button is-static is-size-7">Version</span></div><div class="docs-selector control is-expanded"><div class="select is-fullwidth is-size-7"><select id="documenter-version-selector"></select></div></div></div></nav><div class="docs-main"><header class="docs-navbar"><a class="docs-sidebar-button docs-navbar-link fa-solid fa-bars is-hidden-desktop" id="documenter-sidebar-button" href="#"></a><nav class="breadcrumb"><ul class="is-hidden-mobile"><li class="is-active"><a href>Task-Local Storage</a></li></ul><ul class="is-hidden-tablet"><li class="is-active"><a href>Task-Local Storage</a></li></ul></nav><div class="docs-right"><a class="docs-navbar-link" href="https://github.com/JuliaFolds2/OhMyThreads.jl" title="View the repository on GitHub"><span class="docs-icon fa-brands"></span><span class="docs-label is-hidden-touch">GitHub</span></a><a class="docs-navbar-link" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/master/docs/src/examples/tls/tls.jl#" title="Edit source on GitHub"><span class="docs-icon fa-solid"></span></a><a class="docs-settings-button docs-navbar-link fa-solid fa-gear" id="documenter-settings-button" href="#" title="Settings"></a><a class="docs-article-toggle-button fa-solid fa-chevron-up" id="documenter-article-toggle-button" href="javascript:;" title="Collapse all docstrings"></a></div></header><article class="content" id="documenter-page"><h1 id="Task-Local-Storage"><a class="docs-heading-anchor" href="#Task-Local-Storage">Task-Local Storage</a><a id="Task-Local-Storage-1"></a><a class="docs-heading-anchor-permalink" href="#Task-Local-Storage" title="Permalink"></a></h1><p>For some programs, it can be useful or even necessary to allocate and (re-)use memory in your parallel code. The following section uses a simple example to explain how task-local values can be efficiently created and (re-)used.</p><h2 id="Sequential"><a class="docs-heading-anchor" href="#Sequential">Sequential</a><a id="Sequential-1"></a><a class="docs-heading-anchor-permalink" href="#Sequential" title="Permalink"></a></h2><p>Let&#39;s say that we are given two arrays of (square) matrices, <code>As</code> and <code>Bs</code>, and let&#39;s further assume that our goal is to compute the total sum of all pairwise matrix products. We can readily implement a (sequential) function that performs the necessary computations.</p><pre><code class="language-julia hljs">using LinearAlgebra: mul!, BLAS
+BLAS.set_num_threads(1) # for simplicity, we turn of OpenBLAS multithreading
 
 function matmulsums(As, Bs)
     N = size(first(As), 1)
@@ -10,7 +9,10 @@
         mul!(C, A, B)
         sum(C)
     end
-end
+end</code></pre><pre><code class="nohighlight hljs">matmulsums (generic function with 1 method)</code></pre><p>Here, we use <code>map</code> to perform the desired operation for each pair of matrices, <code>A</code> and <code>B</code>. However, the crucial point for our discussion is that we use the in-place matrix multiplication <code>LinearAlgebra.mul!</code> in conjunction with a pre-allocated output matrix <code>C</code>. This is to avoid the temporary allocation per &quot;iteration&quot; (i.e. per matrix pair) that we would get with <code>C = A*B</code>.</p><p>For later comparison, we generate some random input data and store the result.</p><pre><code class="language-julia hljs">As = [rand(1024, 1024) for _ in 1:64]
+Bs = [rand(1024, 1024) for _ in 1:64]
+
+res = matmulsums(As, Bs);</code></pre><h2 id="Parallelization"><a class="docs-heading-anchor" href="#Parallelization">Parallelization</a><a id="Parallelization-1"></a><a class="docs-heading-anchor-permalink" href="#Parallelization" title="Permalink"></a></h2><p>The key idea for creating a parallel version of <code>matmulsums</code> is to replace the <code>map</code> by OhMyThreads&#39; parallel <a href="../../../refs/api/#OhMyThreads.tmap"><code>tmap</code></a> function. However, because we re-use <code>C</code>, this isn&#39;t entirely trivial.</p><h3 id="The-wrong-way"><a class="docs-heading-anchor" href="#The-wrong-way">The wrong way</a><a id="The-wrong-way-1"></a><a class="docs-heading-anchor-permalink" href="#The-wrong-way" title="Permalink"></a></h3><p>Someone new to parallel computing might be tempted to parallelize <code>matmulsums</code> like so:</p><pre><code class="language-julia hljs">using OhMyThreads: tmap
 
 function matmulsums_race(As, Bs)
     N = size(first(As), 1)
@@ -19,34 +21,38 @@
         mul!(C, A, B)
         sum(C)
     end
-end
-
-function matmulsums_naive(As, Bs)
+end</code></pre><pre><code class="nohighlight hljs">matmulsums_race (generic function with 1 method)</code></pre><p>Unfortunately, this doesn&#39;t produce the correct result.</p><pre><code class="language-julia hljs">res_race = matmulsums_race(As, Bs)
+res ≈ res_race</code></pre><pre><code class="nohighlight hljs">false</code></pre><p>In fact, It doesn&#39;t even always produce the same result (check for yourself)! The reason is that there is a race condition: different parallel tasks are trying to use the shared variable <code>C</code> simultaneously leading to non-deterministic behavior. Let&#39;s see how we can fix this.</p><h3 id="The-naive-(and-inefficient)-way"><a class="docs-heading-anchor" href="#The-naive-(and-inefficient)-way">The naive (and inefficient) way</a><a id="The-naive-(and-inefficient)-way-1"></a><a class="docs-heading-anchor-permalink" href="#The-naive-(and-inefficient)-way" title="Permalink"></a></h3><p>A simple solution for the race condition issue above is to move the allocation of <code>C</code> into the body of the parallel <code>tmap</code>:</p><pre><code class="language-julia hljs">function matmulsums_naive(As, Bs)
     N = size(first(As), 1)
     tmap(As, Bs) do A, B
         C = Matrix{Float64}(undef, N, N)
         mul!(C, A, B)
         sum(C)
     end
-end
+end</code></pre><pre><code class="nohighlight hljs">matmulsums_naive (generic function with 1 method)</code></pre><p>In this case, a separate <code>C</code> will be allocated for each iteration such that parallel tasks don&#39;t modify shared state anymore. Hence, we&#39;ll get the desired result.</p><pre><code class="language-julia hljs">res_naive = matmulsums_naive(As, Bs)
+res ≈ res_naive</code></pre><pre><code class="nohighlight hljs">true</code></pre><p>However, this variant is obviously inefficient because it is no better than just writing <code>C = A*B</code> and thus leads to one allocation per matrix pair. We need a different way of allocating and re-using <code>C</code> for an efficient parallel version.</p><h2 id="The-right-way:-TaskLocalValue"><a class="docs-heading-anchor" href="#The-right-way:-TaskLocalValue">The right way: <code>TaskLocalValue</code></a><a id="The-right-way:-TaskLocalValue-1"></a><a class="docs-heading-anchor-permalink" href="#The-right-way:-TaskLocalValue" title="Permalink"></a></h2><p>We&#39;ve seen that we can&#39;t allocate <code>C</code> once up-front (→ race condition) and also shouldn&#39;t allocate it within the <code>tmap</code> (→ one allocation per iteration). What we actually want is to once allocate a separate <code>C</code> on each parallel task and then re-use this <strong>task-local</strong> <code>C</code> for all iterations (i.e. matrix pairs) that said task is responsible for.</p><p>The way to express this idea is <code>TaskLocalValue</code> and looks like this:</p><pre><code class="language-julia hljs">using OhMyThreads: TaskLocalValue
 
 function matmulsums_tls(As, Bs)
     N = size(first(As), 1)
-    storage = TaskLocalValue{Matrix{Float64}}(() -&gt; Matrix{Float64}(undef, N, N))
+    tls = TaskLocalValue{Matrix{Float64}}(() -&gt; Matrix{Float64}(undef, N, N))
     tmap(As, Bs) do A, B
-        C = storage[]
+        C = tls[]
         mul!(C, A, B)
         sum(C)
     end
 end
 
+res_tls = matmulsums_tls(As, Bs)
+res ≈ res_tls</code></pre><pre><code class="nohighlight hljs">true</code></pre><p>Here, <code>TaskLocalValue{Matrix{Float64}}(() -&gt; Matrix{Float64}(undef, N, N))</code> defines a task-local storage <code>tls</code> that behaves like this: The first time the storage is accessed (<code>tls[]</code>) from a task a task-local value is created according to the anonymous function (here, the task-local value will be a matrix) and stored in the storage. Afterwards, every other storage query from the same task(!) will simply return the task-local value. Hence, this is precisely what we need and will only lead to O(# parallel tasks) allocations.</p><h2 id="The-performant-but-cumbersome-way"><a class="docs-heading-anchor" href="#The-performant-but-cumbersome-way">The performant but cumbersome way</a><a id="The-performant-but-cumbersome-way-1"></a><a class="docs-heading-anchor-permalink" href="#The-performant-but-cumbersome-way" title="Permalink"></a></h2><p>Before we benchmark and compare the performance of all discussed variants, let&#39;s implement the idea of a task-local <code>C</code> for each parallel task manually.</p><pre><code class="language-julia hljs">using OhMyThreads: chunks, @spawn
+using Base.Threads: nthreads
+
 function matmulsums_manual(As, Bs)
     N = size(first(As), 1)
     tasks = map(chunks(As; n = nthreads())) do idcs
         @spawn begin
             local C = Matrix{Float64}(undef, N, N)
             local results = Vector{Float64}(undef, length(idcs))
-            @inbounds for (i, idx) in enumerate(idcs)
+            for (i, idx) in enumerate(idcs)
                 mul!(C, As[idx], Bs[idx])
                 results[i] = sum(C)
             end
@@ -56,29 +62,14 @@
     reduce(vcat, fetch.(tasks))
 end
 
-BLAS.set_num_threads(1) # to avoid potential oversubscription
-
-As = [rand(1024, 1024) for _ in 1:64]
-Bs = [rand(1024, 1024) for _ in 1:64]
-
-res = matmulsums(As, Bs)
-res_race = matmulsums_race(As, Bs)
-res_naive = matmulsums_naive(As, Bs)
-res_tls = matmulsums_tls(As, Bs)
 res_manual = matmulsums_manual(As, Bs)
-
-res ≈ res_race
-res ≈ res_naive
-res ≈ res_tls
-res ≈ res_manual
-
-using BenchmarkTools
+res ≈ res_manual</code></pre><pre><code class="nohighlight hljs">true</code></pre><p>The first thing to note is pretty obvious: This is very cumbersome and you probably don&#39;t want to write it. But let&#39;s take a closer look and see what&#39;s happening here. First, we divide the number of matrix pairs into <code>nthreads()</code> chunks. Then, for each of those chunks, we spawn a parallel task that (1) allocates a task-local <code>C</code> matrix (and a <code>results</code> vector) and (2) performs the actual computations using these pre-allocated values. Finally, we <code>fetch</code> the results of the tasks and combine them.</p><h2 id="Benchmark"><a class="docs-heading-anchor" href="#Benchmark">Benchmark</a><a id="Benchmark-1"></a><a class="docs-heading-anchor-permalink" href="#Benchmark" title="Permalink"></a></h2><p>The whole point of parallelization is increasing performance, so let&#39;s benchmark and compare the performance of the variants discussed above.</p><pre><code class="language-julia hljs">using BenchmarkTools
 
 @btime matmulsums($As, $Bs);
 @btime matmulsums_naive($As, $Bs);
 @btime matmulsums_tls($As, $Bs);
-@btime matmulsums_manual($As, $Bs);</code></pre><pre><code class="nohighlight hljs">  3.107 s (3 allocations: 8.00 MiB)
-  686.432 ms (174 allocations: 512.01 MiB)
-  792.403 ms (67 allocations: 40.01 MiB)
-  684.626 ms (51 allocations: 40.00 MiB)
-</code></pre><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../../translation/">« Translation Guide</a><a class="docs-footer-nextpage" href="../../../refs/api/">Public API »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+@btime matmulsums_manual($As, $Bs);</code></pre><pre><code class="nohighlight hljs">  2.916 s (3 allocations: 8.00 MiB)
+  597.915 ms (174 allocations: 512.01 MiB)
+  575.507 ms (67 allocations: 40.01 MiB)
+  572.501 ms (49 allocations: 40.00 MiB)
+</code></pre><p>As we see, the recommened version <code>matmulsums_tls</code> is both convenient as well as efficient: It allocates much less memory than <code>matmulsums_naive</code> and only slightly more than the manual implementation.</p><hr/><p><em>This page was generated using <a href="https://github.com/fredrikekre/Literate.jl">Literate.jl</a>.</em></p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../../translation/">« Translation Guide</a><a class="docs-footer-nextpage" href="../../../refs/api/">Public API »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
diff --git a/previews/PR38/index.html b/previews/PR38/index.html
index d66aa83d..50a3973f 100644
--- a/previews/PR38/index.html
+++ b/previews/PR38/index.html
@@ -18,4 +18,4 @@
 
 @btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread
 @btime mc_parallel($N)            # running with all 5 Julia threads</code></pre><p>Timings might be something like this:</p><pre><code class="nohighlight hljs">438.394 ms (7 allocations: 624 bytes)
-88.050 ms (37 allocations: 3.02 KiB)</code></pre><p>(Check out the full <a href="examples/mc/mc/#Parallel-Monte-Carlo">Parallel Monte Carlo</a> example if you like.)</p><h2 id="No-Transducers"><a class="docs-heading-anchor" href="#No-Transducers">No Transducers</a><a id="No-Transducers-1"></a><a class="docs-heading-anchor-permalink" href="#No-Transducers" title="Permalink"></a></h2><p>Unlike most <a href="https://github.com/JuliaFolds2">JuliaFolds2</a> packages, OhMyThreads.jl is not built off of <a href="https://github.com/JuliaFolds2/Transducers.jl">Transducers.jl</a>, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., <a href="https://github.com/tkf/ThreadsX.jl">ThreadsX.jl</a> or <a href="https://github.com/JuliaFolds2/Folds.jl">Folds.jl</a>.</p><h2 id="Acknowledgements"><a class="docs-heading-anchor" href="#Acknowledgements">Acknowledgements</a><a id="Acknowledgements-1"></a><a class="docs-heading-anchor-permalink" href="#Acknowledgements" title="Permalink"></a></h2><p>The idea for this package came from <a href="https://github.com/carstenbauer">Carsten Bauer</a> and <a href="https://github.com/MasonProtter">Mason Protter</a>. Check out the <a href="https://github.com/JuliaFolds2/OhMyThreads.jl/graphs/contributors">list of contributors</a> for more information.</p></article><nav class="docs-footer"><a class="docs-footer-nextpage" href="examples/mc/mc/">Parallel Monte Carlo »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+88.050 ms (37 allocations: 3.02 KiB)</code></pre><p>(Check out the full <a href="examples/mc/mc/#Parallel-Monte-Carlo">Parallel Monte Carlo</a> example if you like.)</p><h2 id="No-Transducers"><a class="docs-heading-anchor" href="#No-Transducers">No Transducers</a><a id="No-Transducers-1"></a><a class="docs-heading-anchor-permalink" href="#No-Transducers" title="Permalink"></a></h2><p>Unlike most <a href="https://github.com/JuliaFolds2">JuliaFolds2</a> packages, OhMyThreads.jl is not built off of <a href="https://github.com/JuliaFolds2/Transducers.jl">Transducers.jl</a>, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., <a href="https://github.com/tkf/ThreadsX.jl">ThreadsX.jl</a> or <a href="https://github.com/JuliaFolds2/Folds.jl">Folds.jl</a>.</p><h2 id="Acknowledgements"><a class="docs-heading-anchor" href="#Acknowledgements">Acknowledgements</a><a id="Acknowledgements-1"></a><a class="docs-heading-anchor-permalink" href="#Acknowledgements" title="Permalink"></a></h2><p>The idea for this package came from <a href="https://github.com/carstenbauer">Carsten Bauer</a> and <a href="https://github.com/MasonProtter">Mason Protter</a>. Check out the <a href="https://github.com/JuliaFolds2/OhMyThreads.jl/graphs/contributors">list of contributors</a> for more information.</p></article><nav class="docs-footer"><a class="docs-footer-nextpage" href="examples/mc/mc/">Parallel Monte Carlo »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
diff --git a/previews/PR38/refs/api/index.html b/previews/PR38/refs/api/index.html
index c2f0308d..d15af908 100644
--- a/previews/PR38/refs/api/index.html
+++ b/previews/PR38/refs/api/index.html
@@ -4,29 +4,29 @@
            nchunks::Int = nthreads(),
            split::Symbol = :batch,
            schedule::Symbol =:dynamic,
-           outputtype::Type = Any)</code></pre><p>A multithreaded function like <code>Base.mapreduce</code>. Perform a reduction over <code>A</code>, applying a single-argument function <code>f</code> to each element, and then combining them with the two-argument function <code>op</code>.</p><p>Note that <code>op</code> <strong>must</strong> be an <a href="https://en.wikipedia.org/wiki/Associative_property">associative</a> function, in the sense that <code>op(a, op(b, c)) ≈ op(op(a, b), c)</code>. If <code>op</code> is not (approximately) associative, you will get undefined results.</p><p>For parallelization, the data is divided into chunks and a parallel task is created per chunk.</p><p>To see the keyword argument options, check out <code>??tmapreduce</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs"> tmapreduce(√, +, [1, 2, 3, 4, 5])</code></pre><p>is the parallelized version of <code>sum(√, [1, 2, 3, 4, 5])</code> in the form</p><pre><code class="nohighlight hljs"> (√1 + √2) + (√3 + √4) + √5</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>init</code> optional keyword argument forwarded to <code>mapreduce</code> for the sequential parts of the calculation.</li><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of <code>A</code> in a non-deterministic order, and thus your reducing <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results! This schedule will however work with non-<code>AbstractArray</code> iterables. If you use the <code>:greedy</code> scheduler, we strongly recommend you provide an <code>init</code> keyword argument.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li><li><code>outputtype::Type</code> (default <code>Any</code>) will work as the asserted output type of parallel calculations. This is typically only</li></ul><p>needed if you are using a <code>:static</code> schedule, since the <code>:dynamic</code> schedule is uses <a href="https://github.com/MasonProtter/StableTasks.jl">StableTasks.jl</a>, but if you experience problems with type stability, you may be able to recover it with the <code>outputtype</code> keyword argument.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/83c14efee3e20311992d16cfbdd5189de023e6b3/src/OhMyThreads.jl#L10-L54">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.treduce" href="#OhMyThreads.treduce"><code>OhMyThreads.treduce</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">treduce(op, A::AbstractArray...;
+           outputtype::Type = Any)</code></pre><p>A multithreaded function like <code>Base.mapreduce</code>. Perform a reduction over <code>A</code>, applying a single-argument function <code>f</code> to each element, and then combining them with the two-argument function <code>op</code>.</p><p>Note that <code>op</code> <strong>must</strong> be an <a href="https://en.wikipedia.org/wiki/Associative_property">associative</a> function, in the sense that <code>op(a, op(b, c)) ≈ op(op(a, b), c)</code>. If <code>op</code> is not (approximately) associative, you will get undefined results.</p><p>For parallelization, the data is divided into chunks and a parallel task is created per chunk.</p><p>To see the keyword argument options, check out <code>??tmapreduce</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs"> tmapreduce(√, +, [1, 2, 3, 4, 5])</code></pre><p>is the parallelized version of <code>sum(√, [1, 2, 3, 4, 5])</code> in the form</p><pre><code class="nohighlight hljs"> (√1 + √2) + (√3 + √4) + √5</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>init</code> optional keyword argument forwarded to <code>mapreduce</code> for the sequential parts of the calculation.</li><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of <code>A</code> in a non-deterministic order, and thus your reducing <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results! This schedule will however work with non-<code>AbstractArray</code> iterables. If you use the <code>:greedy</code> scheduler, we strongly recommend you provide an <code>init</code> keyword argument.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li><li><code>outputtype::Type</code> (default <code>Any</code>) will work as the asserted output type of parallel calculations. This is typically only</li></ul><p>needed if you are using a <code>:static</code> schedule, since the <code>:dynamic</code> schedule is uses <a href="https://github.com/MasonProtter/StableTasks.jl">StableTasks.jl</a>, but if you experience problems with type stability, you may be able to recover it with the <code>outputtype</code> keyword argument.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/c05ffa943c64283de1b4d2940a62727427a6d8a9/src/OhMyThreads.jl#L10-L54">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.treduce" href="#OhMyThreads.treduce"><code>OhMyThreads.treduce</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">treduce(op, A::AbstractArray...;
         [init],
         nchunks::Int = nthreads(),
         split::Symbol = :batch,
         schedule::Symbol =:dynamic,
-        outputtype::Type = Any)</code></pre><p>A multithreaded function like <code>Base.reduce</code>. Perform a reduction over <code>A</code> using the two-argument function <code>op</code>.</p><p>Note that <code>op</code> <strong>must</strong> be an <a href="https://en.wikipedia.org/wiki/Associative_property">associative</a> function, in the sense that <code>op(a, op(b, c)) ≈ op(op(a, b), c)</code>. If <code>op</code> is not (approximately) associative, you will get undefined results.</p><p>For parallelization, the data is divided into chunks and a parallel task is created per chunk.</p><p>To see the keyword argument options, check out <code>??treduce</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs">    treduce(+, [1, 2, 3, 4, 5])</code></pre><p>is the parallelized version of <code>sum([1, 2, 3, 4, 5])</code> in the form</p><pre><code class="nohighlight hljs">    (1 + 2) + (3 + 4) + 5</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>init</code> optional keyword argument forwarded to <code>mapreduce</code> for the sequential parts of the calculation.</li><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of <code>A</code> in a non-deterministic order, and thus your reducing <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results! This schedule will however work with non-<code>AbstractArray</code> iterables. If you use the <code>:greedy</code> scheduler, we strongly recommend you provide an <code>init</code> keyword argument.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li><li><code>outputtype::Type</code> (default <code>Any</code>) will work as the asserted output type of parallel calculations. This is typically only</li></ul><p>needed if you are using a <code>:static</code> schedule, since the <code>:dynamic</code> schedule is uses <a href="https://github.com/MasonProtter/StableTasks.jl">StableTasks.jl</a>, but if you experience problems with type stability, you may be able to recover it with the <code>outputtype</code> keyword argument.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/83c14efee3e20311992d16cfbdd5189de023e6b3/src/OhMyThreads.jl#L106-L149">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.tmap" href="#OhMyThreads.tmap"><code>OhMyThreads.tmap</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">tmap(f, [OutputElementType], A::AbstractArray...;
+        outputtype::Type = Any)</code></pre><p>A multithreaded function like <code>Base.reduce</code>. Perform a reduction over <code>A</code> using the two-argument function <code>op</code>.</p><p>Note that <code>op</code> <strong>must</strong> be an <a href="https://en.wikipedia.org/wiki/Associative_property">associative</a> function, in the sense that <code>op(a, op(b, c)) ≈ op(op(a, b), c)</code>. If <code>op</code> is not (approximately) associative, you will get undefined results.</p><p>For parallelization, the data is divided into chunks and a parallel task is created per chunk.</p><p>To see the keyword argument options, check out <code>??treduce</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs">    treduce(+, [1, 2, 3, 4, 5])</code></pre><p>is the parallelized version of <code>sum([1, 2, 3, 4, 5])</code> in the form</p><pre><code class="nohighlight hljs">    (1 + 2) + (3 + 4) + 5</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>init</code> optional keyword argument forwarded to <code>mapreduce</code> for the sequential parts of the calculation.</li><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of <code>A</code> in a non-deterministic order, and thus your reducing <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results! This schedule will however work with non-<code>AbstractArray</code> iterables. If you use the <code>:greedy</code> scheduler, we strongly recommend you provide an <code>init</code> keyword argument.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li><li><code>outputtype::Type</code> (default <code>Any</code>) will work as the asserted output type of parallel calculations. This is typically only</li></ul><p>needed if you are using a <code>:static</code> schedule, since the <code>:dynamic</code> schedule is uses <a href="https://github.com/MasonProtter/StableTasks.jl">StableTasks.jl</a>, but if you experience problems with type stability, you may be able to recover it with the <code>outputtype</code> keyword argument.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/c05ffa943c64283de1b4d2940a62727427a6d8a9/src/OhMyThreads.jl#L106-L149">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.tmap" href="#OhMyThreads.tmap"><code>OhMyThreads.tmap</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">tmap(f, [OutputElementType], A::AbstractArray...;
      nchunks::Int = nthreads(),
      split::Symbol = :batch,
-     schedule::Symbol =:dynamic)</code></pre><p>A multithreaded function like <code>Base.map</code>. Create a new container <code>similar</code> to <code>A</code> whose <code>i</code>th element is equal to <code>f(A[i])</code>. This container is filled in parallel: the data is divided into chunks and a parallel task is created per chunk.</p><p>The optional argument <code>OutputElementType</code> will select a specific element type for the returned container, and will generally incur fewer allocations than the version where <code>OutputElementType</code> is not specified.</p><p>To see the keyword argument options, check out <code>??tmap</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs">    tmap(sin, 1:10)</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the <code>OutputElementType</code> argument is provided.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li></ul></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/83c14efee3e20311992d16cfbdd5189de023e6b3/src/OhMyThreads.jl#L190-L221">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.tmap!" href="#OhMyThreads.tmap!"><code>OhMyThreads.tmap!</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">tmap!(f, out, A::AbstractArray...;
+     schedule::Symbol =:dynamic)</code></pre><p>A multithreaded function like <code>Base.map</code>. Create a new container <code>similar</code> to <code>A</code> whose <code>i</code>th element is equal to <code>f(A[i])</code>. This container is filled in parallel: the data is divided into chunks and a parallel task is created per chunk.</p><p>The optional argument <code>OutputElementType</code> will select a specific element type for the returned container, and will generally incur fewer allocations than the version where <code>OutputElementType</code> is not specified.</p><p>To see the keyword argument options, check out <code>??tmap</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs">    tmap(sin, 1:10)</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the <code>OutputElementType</code> argument is provided.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li></ul></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/c05ffa943c64283de1b4d2940a62727427a6d8a9/src/OhMyThreads.jl#L190-L221">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.tmap!" href="#OhMyThreads.tmap!"><code>OhMyThreads.tmap!</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">tmap!(f, out, A::AbstractArray...;
       nchunks::Int = nthreads(),
       split::Symbol = :batch,
-      schedule::Symbol =:dynamic)</code></pre><p>A multithreaded function like <code>Base.map!</code>. In parallel on multiple tasks, this function assigns each element of <code>out[i] = f(A[i])</code> for each index <code>i</code> of <code>A</code> and <code>out</code>.</p><p>For parallelization, the data is divided into chunks and a parallel task is created per chunk.</p><p>To see the keyword argument options, check out <code>??tmap!</code>.</p><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li></ul></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/83c14efee3e20311992d16cfbdd5189de023e6b3/src/OhMyThreads.jl#L224-L249">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.tforeach" href="#OhMyThreads.tforeach"><code>OhMyThreads.tforeach</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">tforeach(f, A::AbstractArray...;
+      schedule::Symbol =:dynamic)</code></pre><p>A multithreaded function like <code>Base.map!</code>. In parallel on multiple tasks, this function assigns each element of <code>out[i] = f(A[i])</code> for each index <code>i</code> of <code>A</code> and <code>out</code>.</p><p>For parallelization, the data is divided into chunks and a parallel task is created per chunk.</p><p>To see the keyword argument options, check out <code>??tmap!</code>.</p><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li></ul></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/c05ffa943c64283de1b4d2940a62727427a6d8a9/src/OhMyThreads.jl#L224-L249">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.tforeach" href="#OhMyThreads.tforeach"><code>OhMyThreads.tforeach</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">tforeach(f, A::AbstractArray...;
          nchunks::Int = nthreads(),
          split::Symbol = :batch,
          schedule::Symbol =:dynamic) :: Nothing</code></pre><p>A multithreaded function like <code>Base.foreach</code>. Apply <code>f</code> to each element of <code>A</code> on multiple parallel tasks, and return <code>nothing</code>. I.e. it is the parallel equivalent of</p><pre><code class="nohighlight hljs">for x in A
     f(x)
 end</code></pre><p>For parallelization, the data is divided into chunks and a parallel task is created per chunk.</p><p>To see the keyword argument options, check out <code>??tforeach</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs">    tforeach(1:10) do i
         println(i^2)
-    end</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li></ul></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/83c14efee3e20311992d16cfbdd5189de023e6b3/src/OhMyThreads.jl#L152-L187">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.tcollect" href="#OhMyThreads.tcollect"><code>OhMyThreads.tcollect</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">tcollect([OutputElementType], gen::Union{AbstractArray, Generator{&lt;:AbstractArray}};
+    end</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li></ul></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/c05ffa943c64283de1b4d2940a62727427a6d8a9/src/OhMyThreads.jl#L152-L187">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.tcollect" href="#OhMyThreads.tcollect"><code>OhMyThreads.tcollect</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">tcollect([OutputElementType], gen::Union{AbstractArray, Generator{&lt;:AbstractArray}};
          nchunks::Int = nthreads(),
-         schedule::Symbol =:dynamic)</code></pre><p>A multithreaded function like <code>Base.collect</code>. Essentially just calls <code>tmap</code> on the generator function and inputs.</p><p>The optional argument <code>OutputElementType</code> will select a specific element type for the returned container, and will generally incur fewer allocations than the version where <code>OutputElementType</code> is not specified.</p><p>To see the keyword argument options, check out <code>??tcollect</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs">    tcollect(sin(i) for i in 1:10)</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the <code>OutputElementType</code> argument is provided.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li></ul></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/83c14efee3e20311992d16cfbdd5189de023e6b3/src/OhMyThreads.jl#L252-L280">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.treducemap" href="#OhMyThreads.treducemap"><code>OhMyThreads.treducemap</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">treducemap(op, f, A::AbstractArray...;
+         schedule::Symbol =:dynamic)</code></pre><p>A multithreaded function like <code>Base.collect</code>. Essentially just calls <code>tmap</code> on the generator function and inputs.</p><p>The optional argument <code>OutputElementType</code> will select a specific element type for the returned container, and will generally incur fewer allocations than the version where <code>OutputElementType</code> is not specified.</p><p>To see the keyword argument options, check out <code>??tcollect</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs">    tcollect(sin(i) for i in 1:10)</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the <code>OutputElementType</code> argument is provided.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li></ul></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/c05ffa943c64283de1b4d2940a62727427a6d8a9/src/OhMyThreads.jl#L252-L280">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.treducemap" href="#OhMyThreads.treducemap"><code>OhMyThreads.treducemap</code></a> — <span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia hljs">treducemap(op, f, A::AbstractArray...;
            [init],
            nchunks::Int = nthreads(),
            split::Symbol = :batch,
            schedule::Symbol =:dynamic,
-           outputtype::Type = Any)</code></pre><p>Like <code>tmapreduce</code> except the order of the <code>f</code> and <code>op</code> arguments are switched. This is sometimes convenient with <code>do</code>-block notation. Perform a reduction over <code>A</code>, applying a single-argument function <code>f</code> to each element, and then combining them with the two-argument function <code>op</code>.</p><p>Note that <code>op</code> <strong>must</strong> be an <a href="https://en.wikipedia.org/wiki/Associative_property">associative</a> function, in the sense that <code>op(a, op(b, c)) ≈ op(op(a, b), c)</code>. If <code>op</code> is not (approximately) associative, you will get undefined results.</p><p>For parallelization, the data is divided into chunks and a parallel task is created per chunk.</p><p>To see the keyword argument options, check out <code>??treducemap</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs"> tmapreduce(√, +, [1, 2, 3, 4, 5])</code></pre><p>is the parallelized version of <code>sum(√, [1, 2, 3, 4, 5])</code> in the form</p><pre><code class="nohighlight hljs"> (√1 + √2) + (√3 + √4) + √5</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>init</code> optional keyword argument forwarded to <code>mapreduce</code> for the sequential parts of the calculation.</li><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of <code>A</code> in a non-deterministic order, and thus your reducing <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results! This schedule will however work with non-<code>AbstractArray</code> iterables. If you use the <code>:greedy</code> scheduler, we strongly recommend you provide an <code>init</code> keyword argument.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li><li><code>outputtype::Type</code> (default <code>Any</code>) will work as the asserted output type of parallel calculations. This is typically only</li></ul><p>needed if you are using a <code>:static</code> schedule, since the <code>:dynamic</code> schedule is uses <a href="https://github.com/MasonProtter/StableTasks.jl">StableTasks.jl</a>, but if you experience problems with type stability, you may be able to recover it with the <code>outputtype</code> keyword argument.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/83c14efee3e20311992d16cfbdd5189de023e6b3/src/OhMyThreads.jl#L57-L102">source</a></section></article><p>as well as the following re-exported functions:</p><table><tr><th style="text-align: right"></th><th style="text-align: right"></th></tr><tr><td style="text-align: right"><code>chunks</code></td><td style="text-align: right">see <a href="https://juliafolds2.github.io/ChunkSplitters.jl/dev/references/#ChunkSplitters.chunks">ChunkSplitters.jl</a></td></tr></table><h2 id="Non-Exported"><a class="docs-heading-anchor" href="#Non-Exported">Non-Exported</a><a id="Non-Exported-1"></a><a class="docs-heading-anchor-permalink" href="#Non-Exported" title="Permalink"></a></h2><table><tr><th style="text-align: right"></th><th style="text-align: right"></th></tr><tr><td style="text-align: right"><code>OhMyThreads.@spawn</code></td><td style="text-align: right">see <a href="https://github.com/JuliaFolds2/StableTasks.jl">StableTasks.jl</a></td></tr><tr><td style="text-align: right"><code>OhMyThreads.@spawnat</code></td><td style="text-align: right">see <a href="https://github.com/JuliaFolds2/StableTasks.jl">StableTasks.jl</a></td></tr></table></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../examples/tls/tls/">« Task-Local Storage</a><a class="docs-footer-nextpage" href="../internal/">Internal »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+           outputtype::Type = Any)</code></pre><p>Like <code>tmapreduce</code> except the order of the <code>f</code> and <code>op</code> arguments are switched. This is sometimes convenient with <code>do</code>-block notation. Perform a reduction over <code>A</code>, applying a single-argument function <code>f</code> to each element, and then combining them with the two-argument function <code>op</code>.</p><p>Note that <code>op</code> <strong>must</strong> be an <a href="https://en.wikipedia.org/wiki/Associative_property">associative</a> function, in the sense that <code>op(a, op(b, c)) ≈ op(op(a, b), c)</code>. If <code>op</code> is not (approximately) associative, you will get undefined results.</p><p>For parallelization, the data is divided into chunks and a parallel task is created per chunk.</p><p>To see the keyword argument options, check out <code>??treducemap</code>.</p><p><strong>Example:</strong></p><pre><code class="nohighlight hljs"> tmapreduce(√, +, [1, 2, 3, 4, 5])</code></pre><p>is the parallelized version of <code>sum(√, [1, 2, 3, 4, 5])</code> in the form</p><pre><code class="nohighlight hljs"> (√1 + √2) + (√3 + √4) + √5</code></pre><p><strong>Extended help</strong></p><p><strong>Keyword arguments:</strong></p><ul><li><code>init</code> optional keyword argument forwarded to <code>mapreduce</code> for the sequential parts of the calculation.</li><li><code>nchunks::Int</code> (default <code>nthreads()</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it how many pieces of data should be worked on in parallel. Greater <code>nchunks</code> typically helps with <a href="https://en.wikipedia.org/wiki/Load_balancing_(computing)">load balancing</a>, but at the expense of creating more overhead.</li><li><code>split::Symbol</code> (default <code>:batch</code>) is passed to <code>ChunkSplitters.chunks</code> to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If <code>scatter</code> is chosen, then your reducing operator <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results!</li><li><code>schedule::Symbol</code> (default <code>:dynamic</code>), determines how the parallel portions of the calculation are scheduled. Options are one of<ul><li><code>:dynamic</code>: generally preferred since it is more flexible and better at load balancing, and won&#39;t interfere with other multithreaded functions which may be running on the system.</li><li><code>:static</code>: can sometimes be more performant than <code>:dynamic</code> when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.</li><li><code>:greedy</code>: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of <code>A</code> in a non-deterministic order, and thus your reducing <code>op</code> <strong>must</strong> be <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative</a> in addition to being associative, or you could get incorrect results! This schedule will however work with non-<code>AbstractArray</code> iterables. If you use the <code>:greedy</code> scheduler, we strongly recommend you provide an <code>init</code> keyword argument.</li><li><code>:interactive</code>: like <code>:dynamic</code> but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without <code>yield</code>ing as it can interfere with <a href="https://en.wikipedia.org/wiki/Heartbeat_(computing)">heartbeat</a> processes running on the interactive threadpool.</li></ul></li><li><code>outputtype::Type</code> (default <code>Any</code>) will work as the asserted output type of parallel calculations. This is typically only</li></ul><p>needed if you are using a <code>:static</code> schedule, since the <code>:dynamic</code> schedule is uses <a href="https://github.com/MasonProtter/StableTasks.jl">StableTasks.jl</a>, but if you experience problems with type stability, you may be able to recover it with the <code>outputtype</code> keyword argument.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/c05ffa943c64283de1b4d2940a62727427a6d8a9/src/OhMyThreads.jl#L57-L102">source</a></section></article><p>as well as the following re-exported functions:</p><table><tr><th style="text-align: right"></th><th style="text-align: right"></th></tr><tr><td style="text-align: right"><code>chunks</code></td><td style="text-align: right">see <a href="https://juliafolds2.github.io/ChunkSplitters.jl/dev/references/#ChunkSplitters.chunks">ChunkSplitters.jl</a></td></tr></table><h2 id="Non-Exported"><a class="docs-heading-anchor" href="#Non-Exported">Non-Exported</a><a id="Non-Exported-1"></a><a class="docs-heading-anchor-permalink" href="#Non-Exported" title="Permalink"></a></h2><table><tr><th style="text-align: right"></th><th style="text-align: right"></th></tr><tr><td style="text-align: right"><code>OhMyThreads.@spawn</code></td><td style="text-align: right">see <a href="https://github.com/JuliaFolds2/StableTasks.jl">StableTasks.jl</a></td></tr><tr><td style="text-align: right"><code>OhMyThreads.@spawnat</code></td><td style="text-align: right">see <a href="https://github.com/JuliaFolds2/StableTasks.jl">StableTasks.jl</a></td></tr></table></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../examples/tls/tls/">« Task-Local Storage</a><a class="docs-footer-nextpage" href="../internal/">Internal »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
diff --git a/previews/PR38/refs/internal/index.html b/previews/PR38/refs/internal/index.html
index 2bb6fe03..c28ab351 100644
--- a/previews/PR38/refs/internal/index.html
+++ b/previews/PR38/refs/internal/index.html
@@ -1,2 +1,2 @@
 <!DOCTYPE html>
-<html lang="en"><head><meta charset="UTF-8"/><meta name="viewport" content="width=device-width, initial-scale=1.0"/><title>Internal · OhMyThreads.jl</title><meta name="title" content="Internal · OhMyThreads.jl"/><meta property="og:title" content="Internal · OhMyThreads.jl"/><meta property="twitter:title" content="Internal · OhMyThreads.jl"/><meta name="description" content="Documentation for OhMyThreads.jl."/><meta property="og:description" content="Documentation for OhMyThreads.jl."/><meta property="twitter:description" content="Documentation for OhMyThreads.jl."/><script data-outdated-warner src="../../assets/warner.js"></script><link href="https://cdnjs.cloudflare.com/ajax/libs/lato-font/3.0.0/css/lato-font.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/juliamono/0.050/juliamono.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/fontawesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/solid.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/brands.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.8/katex.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL="../.."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.6/require.min.js" data-main="../../assets/documenter.js"></script><script src="../../search_index.js"></script><script src="../../siteinfo.js"></script><script src="../../../versions.js"></script><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../assets/themes/documenter-dark.css" data-theme-name="documenter-dark" data-theme-primary-dark/><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../assets/themes/documenter-light.css" data-theme-name="documenter-light" data-theme-primary/><script src="../../assets/themeswap.js"></script></head><body><div id="documenter"><nav class="docs-sidebar"><div class="docs-package-name"><span class="docs-autofit"><a href="../../">OhMyThreads.jl</a></span></div><button class="docs-search-query input is-rounded is-small is-clickable my-2 mx-auto py-1 px-2" id="documenter-search-query">Search docs (Ctrl + /)</button><ul class="docs-menu"><li><a class="tocitem" href="../../">OhMyThreads</a></li><li><input class="collapse-toggle" id="menuitem-2" type="checkbox"/><label class="tocitem" for="menuitem-2"><span class="docs-label">Examples</span><i class="docs-chevron"></i></label><ul class="collapsed"><li><a class="tocitem" href="../../examples/mc/mc/">Parallel Monte Carlo</a></li><li><a class="tocitem" href="../../examples/juliaset/juliaset/">Julia Set</a></li><li><a class="tocitem" href="../../examples/integration/integration/">Trapezoidal Integration</a></li></ul></li><li><a class="tocitem" href="../../translation/">Translation Guide</a></li><li><a class="tocitem" href="../../examples/tls/tls/">Task-Local Storage</a></li><li><input class="collapse-toggle" id="menuitem-5" type="checkbox" checked/><label class="tocitem" for="menuitem-5"><span class="docs-label">References</span><i class="docs-chevron"></i></label><ul class="collapsed"><li><a class="tocitem" href="../api/">Public API</a></li><li class="is-active"><a class="tocitem" href>Internal</a><ul class="internal"><li><a class="tocitem" href="#Index"><span>Index</span></a></li><li><a class="tocitem" href="#References"><span>References</span></a></li></ul></li></ul></li></ul><div class="docs-version-selector field has-addons"><div class="control"><span class="docs-label button is-static is-size-7">Version</span></div><div class="docs-selector control is-expanded"><div class="select is-fullwidth is-size-7"><select id="documenter-version-selector"></select></div></div></div></nav><div class="docs-main"><header class="docs-navbar"><a class="docs-sidebar-button docs-navbar-link fa-solid fa-bars is-hidden-desktop" id="documenter-sidebar-button" href="#"></a><nav class="breadcrumb"><ul class="is-hidden-mobile"><li><a class="is-disabled">References</a></li><li class="is-active"><a href>Internal</a></li></ul><ul class="is-hidden-tablet"><li class="is-active"><a href>Internal</a></li></ul></nav><div class="docs-right"><a class="docs-navbar-link" href="https://github.com/JuliaFolds2/OhMyThreads.jl" title="View the repository on GitHub"><span class="docs-icon fa-brands"></span><span class="docs-label is-hidden-touch">GitHub</span></a><a class="docs-navbar-link" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/master/docs/src/refs/internal.md#" title="Edit source on GitHub"><span class="docs-icon fa-solid"></span></a><a class="docs-settings-button docs-navbar-link fa-solid fa-gear" id="documenter-settings-button" href="#" title="Settings"></a><a class="docs-article-toggle-button fa-solid fa-chevron-up" id="documenter-article-toggle-button" href="javascript:;" title="Collapse all docstrings"></a></div></header><article class="content" id="documenter-page"><h1 id="Internal"><a class="docs-heading-anchor" href="#Internal">Internal</a><a id="Internal-1"></a><a class="docs-heading-anchor-permalink" href="#Internal" title="Permalink"></a></h1><p><strong>The following is internal, i.e. not public API, and might change at any point.</strong></p><h2 id="Index"><a class="docs-heading-anchor" href="#Index">Index</a><a id="Index-1"></a><a class="docs-heading-anchor-permalink" href="#Index" title="Permalink"></a></h2><ul><li><a href="#OhMyThreads.Tools.nthtid-Tuple{Any}"><code>OhMyThreads.Tools.nthtid</code></a></li><li><a href="#OhMyThreads.Tools.taskid-Tuple{}"><code>OhMyThreads.Tools.taskid</code></a></li></ul><h2 id="References"><a class="docs-heading-anchor" href="#References">References</a><a id="References-1"></a><a class="docs-heading-anchor-permalink" href="#References" title="Permalink"></a></h2><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.Tools.nthtid-Tuple{Any}" href="#OhMyThreads.Tools.nthtid-Tuple{Any}"><code>OhMyThreads.Tools.nthtid</code></a> — <span class="docstring-category">Method</span></header><section><div><p>Returns the thread id of the <code>n</code>th Julia thread in the <code>:default</code> threadpool.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/83c14efee3e20311992d16cfbdd5189de023e6b3/src/tools.jl#L5-L7">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.Tools.taskid-Tuple{}" href="#OhMyThreads.Tools.taskid-Tuple{}"><code>OhMyThreads.Tools.taskid</code></a> — <span class="docstring-category">Method</span></header><section><div><pre><code class="language-julia hljs">taskid() :: UInt</code></pre><p>Return a <code>UInt</code> identifier for the current running <a href="https://docs.julialang.org/en/v1/base/parallel/#Core.Task">Task</a>. This identifier will be unique so long as references to the task it came from still exist. </p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/83c14efee3e20311992d16cfbdd5189de023e6b3/src/tools.jl#L18-L22">source</a></section></article></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../api/">« Public API</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+<html lang="en"><head><meta charset="UTF-8"/><meta name="viewport" content="width=device-width, initial-scale=1.0"/><title>Internal · OhMyThreads.jl</title><meta name="title" content="Internal · OhMyThreads.jl"/><meta property="og:title" content="Internal · OhMyThreads.jl"/><meta property="twitter:title" content="Internal · OhMyThreads.jl"/><meta name="description" content="Documentation for OhMyThreads.jl."/><meta property="og:description" content="Documentation for OhMyThreads.jl."/><meta property="twitter:description" content="Documentation for OhMyThreads.jl."/><script data-outdated-warner src="../../assets/warner.js"></script><link href="https://cdnjs.cloudflare.com/ajax/libs/lato-font/3.0.0/css/lato-font.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/juliamono/0.050/juliamono.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/fontawesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/solid.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/brands.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.8/katex.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL="../.."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.6/require.min.js" data-main="../../assets/documenter.js"></script><script src="../../search_index.js"></script><script src="../../siteinfo.js"></script><script src="../../../versions.js"></script><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../assets/themes/documenter-dark.css" data-theme-name="documenter-dark" data-theme-primary-dark/><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../assets/themes/documenter-light.css" data-theme-name="documenter-light" data-theme-primary/><script src="../../assets/themeswap.js"></script></head><body><div id="documenter"><nav class="docs-sidebar"><div class="docs-package-name"><span class="docs-autofit"><a href="../../">OhMyThreads.jl</a></span></div><button class="docs-search-query input is-rounded is-small is-clickable my-2 mx-auto py-1 px-2" id="documenter-search-query">Search docs (Ctrl + /)</button><ul class="docs-menu"><li><a class="tocitem" href="../../">OhMyThreads</a></li><li><input class="collapse-toggle" id="menuitem-2" type="checkbox"/><label class="tocitem" for="menuitem-2"><span class="docs-label">Examples</span><i class="docs-chevron"></i></label><ul class="collapsed"><li><a class="tocitem" href="../../examples/mc/mc/">Parallel Monte Carlo</a></li><li><a class="tocitem" href="../../examples/juliaset/juliaset/">Julia Set</a></li><li><a class="tocitem" href="../../examples/integration/integration/">Trapezoidal Integration</a></li></ul></li><li><a class="tocitem" href="../../translation/">Translation Guide</a></li><li><a class="tocitem" href="../../examples/tls/tls/">Task-Local Storage</a></li><li><input class="collapse-toggle" id="menuitem-5" type="checkbox" checked/><label class="tocitem" for="menuitem-5"><span class="docs-label">References</span><i class="docs-chevron"></i></label><ul class="collapsed"><li><a class="tocitem" href="../api/">Public API</a></li><li class="is-active"><a class="tocitem" href>Internal</a><ul class="internal"><li><a class="tocitem" href="#Index"><span>Index</span></a></li><li><a class="tocitem" href="#References"><span>References</span></a></li></ul></li></ul></li></ul><div class="docs-version-selector field has-addons"><div class="control"><span class="docs-label button is-static is-size-7">Version</span></div><div class="docs-selector control is-expanded"><div class="select is-fullwidth is-size-7"><select id="documenter-version-selector"></select></div></div></div></nav><div class="docs-main"><header class="docs-navbar"><a class="docs-sidebar-button docs-navbar-link fa-solid fa-bars is-hidden-desktop" id="documenter-sidebar-button" href="#"></a><nav class="breadcrumb"><ul class="is-hidden-mobile"><li><a class="is-disabled">References</a></li><li class="is-active"><a href>Internal</a></li></ul><ul class="is-hidden-tablet"><li class="is-active"><a href>Internal</a></li></ul></nav><div class="docs-right"><a class="docs-navbar-link" href="https://github.com/JuliaFolds2/OhMyThreads.jl" title="View the repository on GitHub"><span class="docs-icon fa-brands"></span><span class="docs-label is-hidden-touch">GitHub</span></a><a class="docs-navbar-link" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/master/docs/src/refs/internal.md#" title="Edit source on GitHub"><span class="docs-icon fa-solid"></span></a><a class="docs-settings-button docs-navbar-link fa-solid fa-gear" id="documenter-settings-button" href="#" title="Settings"></a><a class="docs-article-toggle-button fa-solid fa-chevron-up" id="documenter-article-toggle-button" href="javascript:;" title="Collapse all docstrings"></a></div></header><article class="content" id="documenter-page"><h1 id="Internal"><a class="docs-heading-anchor" href="#Internal">Internal</a><a id="Internal-1"></a><a class="docs-heading-anchor-permalink" href="#Internal" title="Permalink"></a></h1><p><strong>The following is internal, i.e. not public API, and might change at any point.</strong></p><h2 id="Index"><a class="docs-heading-anchor" href="#Index">Index</a><a id="Index-1"></a><a class="docs-heading-anchor-permalink" href="#Index" title="Permalink"></a></h2><ul><li><a href="#OhMyThreads.Tools.nthtid-Tuple{Any}"><code>OhMyThreads.Tools.nthtid</code></a></li><li><a href="#OhMyThreads.Tools.taskid-Tuple{}"><code>OhMyThreads.Tools.taskid</code></a></li></ul><h2 id="References"><a class="docs-heading-anchor" href="#References">References</a><a id="References-1"></a><a class="docs-heading-anchor-permalink" href="#References" title="Permalink"></a></h2><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.Tools.nthtid-Tuple{Any}" href="#OhMyThreads.Tools.nthtid-Tuple{Any}"><code>OhMyThreads.Tools.nthtid</code></a> — <span class="docstring-category">Method</span></header><section><div><p>Returns the thread id of the <code>n</code>th Julia thread in the <code>:default</code> threadpool.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/c05ffa943c64283de1b4d2940a62727427a6d8a9/src/tools.jl#L5-L7">source</a></section></article><article class="docstring"><header><a class="docstring-article-toggle-button fa-solid fa-chevron-down" href="javascript:;" title="Collapse docstring"></a><a class="docstring-binding" id="OhMyThreads.Tools.taskid-Tuple{}" href="#OhMyThreads.Tools.taskid-Tuple{}"><code>OhMyThreads.Tools.taskid</code></a> — <span class="docstring-category">Method</span></header><section><div><pre><code class="language-julia hljs">taskid() :: UInt</code></pre><p>Return a <code>UInt</code> identifier for the current running <a href="https://docs.julialang.org/en/v1/base/parallel/#Core.Task">Task</a>. This identifier will be unique so long as references to the task it came from still exist. </p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/JuliaFolds2/OhMyThreads.jl/blob/c05ffa943c64283de1b4d2940a62727427a6d8a9/src/tools.jl#L18-L22">source</a></section></article></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../api/">« Public API</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
diff --git a/previews/PR38/search_index.js b/previews/PR38/search_index.js
index 42994573..cbb8e39e 100644
--- a/previews/PR38/search_index.js
+++ b/previews/PR38/search_index.js
@@ -1,3 +1,3 @@
 var documenterSearchIndex = {"docs":
-[{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"EditURL = \"juliaset.jl\"","category":"page"},{"location":"examples/juliaset/juliaset/#Julia-Set","page":"Julia Set","title":"Julia Set","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)\n    x = -2.0 + (j - 1) * 4.0 / (n - 1)\n    y = -2.0 + (i - 1) * 4.0 / (n - 1)\n\n    z = x + y * im\n    iter = max_iter\n    for k in 1:max_iter\n        if abs2(z) > 4.0\n            iter = k - 1\n            break\n        end\n        z = z^2 + c\n    end\n    return iter\nend","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"_compute_pixel (generic function with 1 method)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.","category":"page"},{"location":"examples/juliaset/juliaset/#Sequential-computation","page":"Julia Set","title":"Sequential computation","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function compute_juliaset_sequential!(img)\n    N = size(img, 1)\n    for j in 1:N\n        for i in 1:N\n            img[i, j] = _compute_pixel(i, j, N)\n        end\n    end\n    return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_sequential!(img);","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's look at the result","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using Plots\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Parallelization","page":"Julia Set","title":"Parallelization","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: tmap!\n\nfunction compute_juliaset_parallel!(img; kwargs...)\n    N = size(img, 1)\n    cart = CartesianIndices(img)\n    tmap!(img, eachindex(img); kwargs...) do idx\n        c = cart[idx]\n        _compute_pixel(c[1], c[2], N)\n    end\n    return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_parallel!(img);\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Benchmark","page":"Julia Set","title":"Benchmark","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's benchmark the variants above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\nN = 2000\nimg = zeros(Int, N, N)\n\n@btime compute_juliaset_sequential!($img) samples=10 evals=3;\n@btime compute_juliaset_parallel!($img) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"  138.377 ms (0 allocations: 0 bytes)\n  63.707 ms (39 allocations: 3.30 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As hoped, the parallel implementation is faster. But can we improve the performance further?","category":"page"},{"location":"examples/juliaset/juliaset/#Tuning-nchunks","page":"Julia Set","title":"Tuning nchunks","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"  32.000 ms (12013 allocations: 1.14 MiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"  63.439 ms (42 allocations: 3.37 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"This page was generated using Literate.jl.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"EditURL = \"mc.jl\"","category":"page"},{"location":"examples/mc/mc/#Parallel-Monte-Carlo","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Calculate the value of pi through parallel direct Monte Carlo.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is pi, the area of the square is 4, and the ratio is pi4. This means that, if you throw N darts randomly at the square, approximately M=Npi4 of those darts will land inside the unit circle.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Throw darts randomly at a unit square and count how many of them (M) landed inside of a unit circle. Approximate pi approx 4MN.","category":"page"},{"location":"examples/mc/mc/#Sequential-implementation:","page":"Parallel Monte Carlo","title":"Sequential implementation:","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"function mc(N)\n    M = 0 # number of darts that landed in the circle\n    for i in 1:N\n        if rand()^2 + rand()^2 < 1.0\n            M += 1\n        end\n    end\n    pi = 4 * M / N\n    return pi\nend\n\nN = 100_000_000\n\nmc(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.141517","category":"page"},{"location":"examples/mc/mc/#Parallelization-with-tmapreduce","page":"Parallel Monte Carlo","title":"Parallelization with tmapreduce","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and \"throw one dart\" per element.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads\n\nfunction mc_parallel(N)\n    M = tmapreduce(+, 1:N) do i\n        rand()^2 + rand()^2 < 1.0\n    end\n    pi = 4 * M / N\n    return pi\nend\n\nmc_parallel(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14159924","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Let's run a quick benchmark.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\n@assert nthreads() > 1 # make sure we have multiple Julia threads\n@show nthreads()       # print out the number of threads\n\n@btime mc($N) samples=10 evals=3;\n@btime mc_parallel($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"nthreads() = 5\n  318.467 ms (0 allocations: 0 bytes)\n  88.553 ms (37 allocations: 3.02 KiB)\n","category":"page"},{"location":"examples/mc/mc/#Manual-parallelization","page":"Parallel Monte Carlo","title":"Manual parallelization","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for pi.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: @spawn\n\nfunction mc_parallel_manual(N; nchunks = nthreads())\n    tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready\n        @spawn mc(length(idcs))\n    end\n    pi = sum(fetch, tasks) / nchunks\n    return pi\nend\n\nmc_parallel_manual(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.1415844","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"And this is the performance:","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"@btime mc_parallel_manual($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"  63.825 ms (31 allocations: 2.80 KiB)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"idcs = first(chunks(1:N; n = nthreads()))\n\n@btime mapreduce($+, $idcs) do i\n    rand()^2 + rand()^2 < 1.0\nend samples=10 evals=3;\n\n@btime mc($(length(idcs))) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"  87.617 ms (0 allocations: 0 bytes)\n  63.398 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/internal/#Internal","page":"Internal","title":"Internal","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"The following is internal, i.e. not public API, and might change at any point.","category":"page"},{"location":"refs/internal/#Index","page":"Internal","title":"Index","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Pages   = [\"internal.md\"]\nOrder   = [:function, :macro]","category":"page"},{"location":"refs/internal/#References","page":"Internal","title":"References","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Modules = [OhMyThreads, OhMyThreads.Tools]\nPublic = false\nPages   = [\"OhMyThreads.jl\", \"tools.jl\"]","category":"page"},{"location":"refs/internal/#OhMyThreads.Tools.nthtid-Tuple{Any}","page":"Internal","title":"OhMyThreads.Tools.nthtid","text":"Returns the thread id of the nth Julia thread in the :default threadpool.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.taskid-Tuple{}","page":"Internal","title":"OhMyThreads.Tools.taskid","text":"taskid() :: UInt\n\nReturn a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist. \n\n\n\n\n\n","category":"method"},{"location":"refs/api/#API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"refs/api/#Index","page":"Public API","title":"Index","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"Pages   = [\"api.md\"]\nOrder   = [:function, :macro]","category":"page"},{"location":"refs/api/#Exported","page":"Public API","title":"Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"tmapreduce\ntreduce\ntmap\ntmap!\ntforeach\ntcollect\ntreducemap","category":"page"},{"location":"refs/api/#OhMyThreads.tmapreduce","page":"Public API","title":"OhMyThreads.tmapreduce","text":"tmapreduce(f, op, A::AbstractArray...;\n           [init],\n           nchunks::Int = nthreads(),\n           split::Symbol = :batch,\n           schedule::Symbol =:dynamic,\n           outputtype::Type = Any)\n\nA multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tmapreduce.\n\nExample:\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n (√1 + √2) + (√3 + √4) + √5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treduce","page":"Public API","title":"OhMyThreads.treduce","text":"treduce(op, A::AbstractArray...;\n        [init],\n        nchunks::Int = nthreads(),\n        split::Symbol = :batch,\n        schedule::Symbol =:dynamic,\n        outputtype::Type = Any)\n\nA multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??treduce.\n\nExample:\n\n    treduce(+, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum([1, 2, 3, 4, 5]) in the form\n\n    (1 + 2) + (3 + 4) + 5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap","page":"Public API","title":"OhMyThreads.tmap","text":"tmap(f, [OutputElementType], A::AbstractArray...;\n     nchunks::Int = nthreads(),\n     split::Symbol = :batch,\n     schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel: the data is divided into chunks and a parallel task is created per chunk.\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nTo see the keyword argument options, check out ??tmap.\n\nExample:\n\n    tmap(sin, 1:10)\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap!","page":"Public API","title":"OhMyThreads.tmap!","text":"tmap!(f, out, A::AbstractArray...;\n      nchunks::Int = nthreads(),\n      split::Symbol = :batch,\n      schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tmap!.\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tforeach","page":"Public API","title":"OhMyThreads.tforeach","text":"tforeach(f, A::AbstractArray...;\n         nchunks::Int = nthreads(),\n         split::Symbol = :batch,\n         schedule::Symbol =:dynamic) :: Nothing\n\nA multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing. I.e. it is the parallel equivalent of\n\nfor x in A\n    f(x)\nend\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tforeach.\n\nExample:\n\n    tforeach(1:10) do i\n        println(i^2)\n    end\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tcollect","page":"Public API","title":"OhMyThreads.tcollect","text":"tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};\n         nchunks::Int = nthreads(),\n         schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nTo see the keyword argument options, check out ??tcollect.\n\nExample:\n\n    tcollect(sin(i) for i in 1:10)\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treducemap","page":"Public API","title":"OhMyThreads.treducemap","text":"treducemap(op, f, A::AbstractArray...;\n           [init],\n           nchunks::Int = nthreads(),\n           split::Symbol = :batch,\n           schedule::Symbol =:dynamic,\n           outputtype::Type = Any)\n\nLike tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??treducemap.\n\nExample:\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n (√1 + √2) + (√3 + √4) + √5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"as well as the following re-exported functions:","category":"page"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nchunks see ChunkSplitters.jl","category":"page"},{"location":"refs/api/#Non-Exported","page":"Public API","title":"Non-Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.@spawn see StableTasks.jl\nOhMyThreads.@spawnat see StableTasks.jl","category":"page"},{"location":"translation/#Translation-Guide","page":"Translation Guide","title":"Translation Guide","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"This page tries to give a general overview of how to translate patterns written with the built-in tools of Base.Threads using the OhMyThreads.jl API. Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.","category":"page"},{"location":"translation/#Basics","page":"Translation Guide","title":"Basics","text":"","category":"section"},{"location":"translation/#@threads","page":"Translation Guide","title":"@threads","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@threads for i in 1:10\n    println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10) do i\n    println(i)\nend","category":"page"},{"location":"translation/#:static-scheduling","page":"Translation Guide","title":":static scheduling","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@threads :static for i in 1:10\n    println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10; schedule=:static) do i\n    println(i)\nend","category":"page"},{"location":"translation/#@spawn","page":"Translation Guide","title":"@spawn","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@sync for i in 1:10\n    @spawn println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10; nchunks=10) do i\n    println(i)\nend","category":"page"},{"location":"translation/#Reduction","page":"Translation Guide","title":"Reduction","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"No built-in feature in Base.Threads.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads: basic manual implementation\ndata = rand(10)\nchunks_itr = Iterators.partition(data, length(data) ÷ nthreads())\ntasks = map(chunks_itr) do chunk\n    @spawn reduce(+, chunk)\nend\nreduce(+, fetch.(tasks))","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ndata = rand(10)\ntreduce(+, data)","category":"page"},{"location":"translation/#Mutation","page":"Translation Guide","title":"Mutation","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"warning: Warning\nParallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. false sharing). You should carefully consider whether this is necessary or whether the use of task-local storage is the better option.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\ndata = rand(10)\n@threads for i in 1:10\n    data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 1\ndata = rand(10)\ntforeach(data) do i\n    data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 2\ndata = rand(10)\ntmap!(data, data) do i # this kind of aliasing is fine\n    calc(i)\nend","category":"page"},{"location":"translation/#Parallel-initialization","page":"Translation Guide","title":"Parallel initialization","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\ndata = Vector{Float64}(undef, 10)\n@threads for i in 1:10\n    data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 1\ndata = tmap(i->calc(i), 1:10)","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 2\ndata = tcollect(calc(i) for i in 1:10)","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"EditURL = \"tls.jl\"","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using OhMyThreads: TaskLocalValue, tmap, chunks\nusing LinearAlgebra: mul!, BLAS\nusing Base.Threads: nthreads, @spawn\n\nfunction matmulsums(As, Bs)\n    N = size(first(As), 1)\n    C = Matrix{Float64}(undef, N, N)\n    map(As, Bs) do A, B\n        mul!(C, A, B)\n        sum(C)\n    end\nend\n\nfunction matmulsums_race(As, Bs)\n    N = size(first(As), 1)\n    C = Matrix{Float64}(undef, N, N)\n    tmap(As, Bs) do A, B\n        mul!(C, A, B)\n        sum(C)\n    end\nend\n\nfunction matmulsums_naive(As, Bs)\n    N = size(first(As), 1)\n    tmap(As, Bs) do A, B\n        C = Matrix{Float64}(undef, N, N)\n        mul!(C, A, B)\n        sum(C)\n    end\nend\n\nfunction matmulsums_tls(As, Bs)\n    N = size(first(As), 1)\n    storage = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))\n    tmap(As, Bs) do A, B\n        C = storage[]\n        mul!(C, A, B)\n        sum(C)\n    end\nend\n\nfunction matmulsums_manual(As, Bs)\n    N = size(first(As), 1)\n    tasks = map(chunks(As; n = nthreads())) do idcs\n        @spawn begin\n            local C = Matrix{Float64}(undef, N, N)\n            local results = Vector{Float64}(undef, length(idcs))\n            @inbounds for (i, idx) in enumerate(idcs)\n                mul!(C, As[idx], Bs[idx])\n                results[i] = sum(C)\n            end\n            results\n        end\n    end\n    reduce(vcat, fetch.(tasks))\nend\n\nBLAS.set_num_threads(1) # to avoid potential oversubscription\n\nAs = [rand(1024, 1024) for _ in 1:64]\nBs = [rand(1024, 1024) for _ in 1:64]\n\nres = matmulsums(As, Bs)\nres_race = matmulsums_race(As, Bs)\nres_naive = matmulsums_naive(As, Bs)\nres_tls = matmulsums_tls(As, Bs)\nres_manual = matmulsums_manual(As, Bs)\n\nres ≈ res_race\nres ≈ res_naive\nres ≈ res_tls\nres ≈ res_manual\n\nusing BenchmarkTools\n\n@btime matmulsums($As, $Bs);\n@btime matmulsums_naive($As, $Bs);\n@btime matmulsums_tls($As, $Bs);\n@btime matmulsums_manual($As, $Bs);","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"  3.107 s (3 allocations: 8.00 MiB)\n  686.432 ms (174 allocations: 512.01 MiB)\n  792.403 ms (67 allocations: 40.01 MiB)\n  684.626 ms (51 allocations: 40.00 MiB)\n","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"This page was generated using Literate.jl.","category":"page"},{"location":"#OhMyThreads.jl","page":"OhMyThreads","title":"OhMyThreads.jl","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a focus on data parallelism, that can be used without having to worry much about manual Task creation.","category":"page"},{"location":"#Quick-Start","page":"OhMyThreads","title":"Quick Start","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The package is registered. Hence, you can simply use","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"] add OhMyThreads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"to add the package to your Julia environment.","category":"page"},{"location":"#Basic-example","page":"OhMyThreads","title":"Basic example","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"using OhMyThreads\n\nfunction mc_parallel(N; kw...)\n    M = tmapreduce(+, 1:N; kw...) do i\n        rand()^2 + rand()^2 < 1.0\n    end\n    pi = 4 * M / N\n    return pi\nend\n\nN = 100_000_000\nmc_parallel(N) # gives, e.g., 3.14159924\n\nusing BenchmarkTools\n\n@show Threads.nthreads()          # 5 in this example\n\n@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread\n@btime mc_parallel($N)            # running with all 5 Julia threads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Timings might be something like this:","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"438.394 ms (7 allocations: 624 bytes)\n88.050 ms (37 allocations: 3.02 KiB)","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"(Check out the full Parallel Monte Carlo example if you like.)","category":"page"},{"location":"#No-Transducers","page":"OhMyThreads","title":"No Transducers","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.","category":"page"},{"location":"#Acknowledgements","page":"OhMyThreads","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"EditURL = \"integration.jl\"","category":"page"},{"location":"examples/integration/integration/#Trapezoidal-Integration","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"In this example, we want to parallelize the computation of a simple numerical integral via the trapezoidal rule. The latter is given by","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"int_a^bf(x)dx approx h sum_i=1^Nfracf(x_i-1)+f(x_i)2","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The function to be integrated is the following.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f(x) = 4 * √(1 - x^2)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The analytic result of the definite integral (from 0 to 1) is known to be pi.","category":"page"},{"location":"examples/integration/integration/#Sequential","page":"Trapezoidal Integration","title":"Sequential","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Naturally, we implement the trapezoidal rule as a straightforward, sequential for loop.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"function trapezoidal(a, b, n; h = (b - a) / n)\n    y = (f(a) + f(b)) / 2.0\n    for i in 1:(n - 1)\n        x = a + i * h\n        y = y + f(x)\n    end\n    return y * h\nend","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Let's compute the integral of f above and see if we get the expected result. For simplicity, we choose N, the number of panels used to discretize the integration interval, as a multiple of the number of available Julia threads.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using Base.Threads: nthreads\n@show nthreads()\nN = nthreads() * 1_000_000","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"5000000","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Calling trapezoidal we do indeed find the (approximate) value of pi.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal(0, 1, N) ≈ π","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"examples/integration/integration/#Parallel","page":"Trapezoidal Integration","title":"Parallel","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Our strategy is the following: Divide the integration interval among the available Julia threads. On each thread, use the sequential trapezoidal rule to compute the partial integral. It is straightforward to implement this strategy with tmapreduce. The map part is, essentially, the application of trapezoidal and the reduction operator is chosen to be + to sum up the local integrals.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using OhMyThreads\n\nfunction trapezoidal_parallel(a, b, N)\n    n = N ÷ nthreads()\n    h = (b - a) / N\n    return tmapreduce(+, 1:nthreads()) do i\n        local α = a + (i - 1) * n * h\n        local β = α + n * h\n        trapezoidal(α, β, n; h)\n    end\nend","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"First, we check the correctness of our parallel implementation.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel(0, 1, N) ≈ π","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Then, we benchmark and compare the performance of the sequential and parallel versions.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using BenchmarkTools\n@btime trapezoidal(0, 1, $N);\n@btime trapezoidal_parallel(0, 1, $N);","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"  13.871 ms (0 allocations: 0 bytes)\n  2.781 ms (38 allocations: 3.19 KiB)\n","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"This page was generated using Literate.jl.","category":"page"}]
+[{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"EditURL = \"juliaset.jl\"","category":"page"},{"location":"examples/juliaset/juliaset/#Julia-Set","page":"Julia Set","title":"Julia Set","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)\n    x = -2.0 + (j - 1) * 4.0 / (n - 1)\n    y = -2.0 + (i - 1) * 4.0 / (n - 1)\n\n    z = x + y * im\n    iter = max_iter\n    for k in 1:max_iter\n        if abs2(z) > 4.0\n            iter = k - 1\n            break\n        end\n        z = z^2 + c\n    end\n    return iter\nend","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"_compute_pixel (generic function with 1 method)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.","category":"page"},{"location":"examples/juliaset/juliaset/#Sequential-computation","page":"Julia Set","title":"Sequential computation","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function compute_juliaset_sequential!(img)\n    N = size(img, 1)\n    for j in 1:N\n        for i in 1:N\n            img[i, j] = _compute_pixel(i, j, N)\n        end\n    end\n    return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_sequential!(img);","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's look at the result","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using Plots\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Parallelization","page":"Julia Set","title":"Parallelization","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: tmap!\n\nfunction compute_juliaset_parallel!(img; kwargs...)\n    N = size(img, 1)\n    cart = CartesianIndices(img)\n    tmap!(img, eachindex(img); kwargs...) do idx\n        c = cart[idx]\n        _compute_pixel(c[1], c[2], N)\n    end\n    return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_parallel!(img);\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Benchmark","page":"Julia Set","title":"Benchmark","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's benchmark the variants above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\nN = 2000\nimg = zeros(Int, N, N)\n\n@btime compute_juliaset_sequential!($img) samples=10 evals=3;\n@btime compute_juliaset_parallel!($img) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"  138.377 ms (0 allocations: 0 bytes)\n  63.707 ms (39 allocations: 3.30 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As hoped, the parallel implementation is faster. But can we improve the performance further?","category":"page"},{"location":"examples/juliaset/juliaset/#Tuning-nchunks","page":"Julia Set","title":"Tuning nchunks","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"  32.000 ms (12013 allocations: 1.14 MiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"  63.439 ms (42 allocations: 3.37 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"This page was generated using Literate.jl.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"EditURL = \"mc.jl\"","category":"page"},{"location":"examples/mc/mc/#Parallel-Monte-Carlo","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Calculate the value of pi through parallel direct Monte Carlo.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is pi, the area of the square is 4, and the ratio is pi4. This means that, if you throw N darts randomly at the square, approximately M=Npi4 of those darts will land inside the unit circle.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Throw darts randomly at a unit square and count how many of them (M) landed inside of a unit circle. Approximate pi approx 4MN.","category":"page"},{"location":"examples/mc/mc/#Sequential-implementation:","page":"Parallel Monte Carlo","title":"Sequential implementation:","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"function mc(N)\n    M = 0 # number of darts that landed in the circle\n    for i in 1:N\n        if rand()^2 + rand()^2 < 1.0\n            M += 1\n        end\n    end\n    pi = 4 * M / N\n    return pi\nend\n\nN = 100_000_000\n\nmc(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.141517","category":"page"},{"location":"examples/mc/mc/#Parallelization-with-tmapreduce","page":"Parallel Monte Carlo","title":"Parallelization with tmapreduce","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and \"throw one dart\" per element.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads\n\nfunction mc_parallel(N)\n    M = tmapreduce(+, 1:N) do i\n        rand()^2 + rand()^2 < 1.0\n    end\n    pi = 4 * M / N\n    return pi\nend\n\nmc_parallel(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14159924","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Let's run a quick benchmark.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\n@assert nthreads() > 1 # make sure we have multiple Julia threads\n@show nthreads()       # print out the number of threads\n\n@btime mc($N) samples=10 evals=3;\n@btime mc_parallel($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"nthreads() = 5\n  318.467 ms (0 allocations: 0 bytes)\n  88.553 ms (37 allocations: 3.02 KiB)\n","category":"page"},{"location":"examples/mc/mc/#Manual-parallelization","page":"Parallel Monte Carlo","title":"Manual parallelization","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for pi.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: @spawn\n\nfunction mc_parallel_manual(N; nchunks = nthreads())\n    tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready\n        @spawn mc(length(idcs))\n    end\n    pi = sum(fetch, tasks) / nchunks\n    return pi\nend\n\nmc_parallel_manual(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.1415844","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"And this is the performance:","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"@btime mc_parallel_manual($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"  63.825 ms (31 allocations: 2.80 KiB)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"idcs = first(chunks(1:N; n = nthreads()))\n\n@btime mapreduce($+, $idcs) do i\n    rand()^2 + rand()^2 < 1.0\nend samples=10 evals=3;\n\n@btime mc($(length(idcs))) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"  87.617 ms (0 allocations: 0 bytes)\n  63.398 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/internal/#Internal","page":"Internal","title":"Internal","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"The following is internal, i.e. not public API, and might change at any point.","category":"page"},{"location":"refs/internal/#Index","page":"Internal","title":"Index","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Pages   = [\"internal.md\"]\nOrder   = [:function, :macro]","category":"page"},{"location":"refs/internal/#References","page":"Internal","title":"References","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Modules = [OhMyThreads, OhMyThreads.Tools]\nPublic = false\nPages   = [\"OhMyThreads.jl\", \"tools.jl\"]","category":"page"},{"location":"refs/internal/#OhMyThreads.Tools.nthtid-Tuple{Any}","page":"Internal","title":"OhMyThreads.Tools.nthtid","text":"Returns the thread id of the nth Julia thread in the :default threadpool.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.taskid-Tuple{}","page":"Internal","title":"OhMyThreads.Tools.taskid","text":"taskid() :: UInt\n\nReturn a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist. \n\n\n\n\n\n","category":"method"},{"location":"refs/api/#API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"refs/api/#Index","page":"Public API","title":"Index","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"Pages   = [\"api.md\"]\nOrder   = [:function, :macro]","category":"page"},{"location":"refs/api/#Exported","page":"Public API","title":"Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"tmapreduce\ntreduce\ntmap\ntmap!\ntforeach\ntcollect\ntreducemap","category":"page"},{"location":"refs/api/#OhMyThreads.tmapreduce","page":"Public API","title":"OhMyThreads.tmapreduce","text":"tmapreduce(f, op, A::AbstractArray...;\n           [init],\n           nchunks::Int = nthreads(),\n           split::Symbol = :batch,\n           schedule::Symbol =:dynamic,\n           outputtype::Type = Any)\n\nA multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tmapreduce.\n\nExample:\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n (√1 + √2) + (√3 + √4) + √5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treduce","page":"Public API","title":"OhMyThreads.treduce","text":"treduce(op, A::AbstractArray...;\n        [init],\n        nchunks::Int = nthreads(),\n        split::Symbol = :batch,\n        schedule::Symbol =:dynamic,\n        outputtype::Type = Any)\n\nA multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??treduce.\n\nExample:\n\n    treduce(+, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum([1, 2, 3, 4, 5]) in the form\n\n    (1 + 2) + (3 + 4) + 5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap","page":"Public API","title":"OhMyThreads.tmap","text":"tmap(f, [OutputElementType], A::AbstractArray...;\n     nchunks::Int = nthreads(),\n     split::Symbol = :batch,\n     schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel: the data is divided into chunks and a parallel task is created per chunk.\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nTo see the keyword argument options, check out ??tmap.\n\nExample:\n\n    tmap(sin, 1:10)\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap!","page":"Public API","title":"OhMyThreads.tmap!","text":"tmap!(f, out, A::AbstractArray...;\n      nchunks::Int = nthreads(),\n      split::Symbol = :batch,\n      schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tmap!.\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tforeach","page":"Public API","title":"OhMyThreads.tforeach","text":"tforeach(f, A::AbstractArray...;\n         nchunks::Int = nthreads(),\n         split::Symbol = :batch,\n         schedule::Symbol =:dynamic) :: Nothing\n\nA multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing. I.e. it is the parallel equivalent of\n\nfor x in A\n    f(x)\nend\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tforeach.\n\nExample:\n\n    tforeach(1:10) do i\n        println(i^2)\n    end\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tcollect","page":"Public API","title":"OhMyThreads.tcollect","text":"tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};\n         nchunks::Int = nthreads(),\n         schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nTo see the keyword argument options, check out ??tcollect.\n\nExample:\n\n    tcollect(sin(i) for i in 1:10)\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treducemap","page":"Public API","title":"OhMyThreads.treducemap","text":"treducemap(op, f, A::AbstractArray...;\n           [init],\n           nchunks::Int = nthreads(),\n           split::Symbol = :batch,\n           schedule::Symbol =:dynamic,\n           outputtype::Type = Any)\n\nLike tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??treducemap.\n\nExample:\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n (√1 + √2) + (√3 + √4) + √5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"as well as the following re-exported functions:","category":"page"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nchunks see ChunkSplitters.jl","category":"page"},{"location":"refs/api/#Non-Exported","page":"Public API","title":"Non-Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.@spawn see StableTasks.jl\nOhMyThreads.@spawnat see StableTasks.jl","category":"page"},{"location":"translation/#Translation-Guide","page":"Translation Guide","title":"Translation Guide","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"This page tries to give a general overview of how to translate patterns written with the built-in tools of Base.Threads using the OhMyThreads.jl API. Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.","category":"page"},{"location":"translation/#Basics","page":"Translation Guide","title":"Basics","text":"","category":"section"},{"location":"translation/#@threads","page":"Translation Guide","title":"@threads","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@threads for i in 1:10\n    println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10) do i\n    println(i)\nend","category":"page"},{"location":"translation/#:static-scheduling","page":"Translation Guide","title":":static scheduling","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@threads :static for i in 1:10\n    println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10; schedule=:static) do i\n    println(i)\nend","category":"page"},{"location":"translation/#@spawn","page":"Translation Guide","title":"@spawn","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@sync for i in 1:10\n    @spawn println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10; nchunks=10) do i\n    println(i)\nend","category":"page"},{"location":"translation/#Reduction","page":"Translation Guide","title":"Reduction","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"No built-in feature in Base.Threads.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads: basic manual implementation\ndata = rand(10)\nchunks_itr = Iterators.partition(data, length(data) ÷ nthreads())\ntasks = map(chunks_itr) do chunk\n    @spawn reduce(+, chunk)\nend\nreduce(+, fetch.(tasks))","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ndata = rand(10)\ntreduce(+, data)","category":"page"},{"location":"translation/#Mutation","page":"Translation Guide","title":"Mutation","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"warning: Warning\nParallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. false sharing). You should carefully consider whether this is necessary or whether the use of task-local storage is the better option.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\ndata = rand(10)\n@threads for i in 1:10\n    data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 1\ndata = rand(10)\ntforeach(data) do i\n    data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 2\ndata = rand(10)\ntmap!(data, data) do i # this kind of aliasing is fine\n    calc(i)\nend","category":"page"},{"location":"translation/#Parallel-initialization","page":"Translation Guide","title":"Parallel initialization","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\ndata = Vector{Float64}(undef, 10)\n@threads for i in 1:10\n    data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 1\ndata = tmap(i->calc(i), 1:10)","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 2\ndata = tcollect(calc(i) for i in 1:10)","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"EditURL = \"tls.jl\"","category":"page"},{"location":"examples/tls/tls/#Task-Local-Storage","page":"Task-Local Storage","title":"Task-Local Storage","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"For some programs, it can be useful or even necessary to allocate and (re-)use memory in your parallel code. The following section uses a simple example to explain how task-local values can be efficiently created and (re-)used.","category":"page"},{"location":"examples/tls/tls/#Sequential","page":"Task-Local Storage","title":"Sequential","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Let's say that we are given two arrays of (square) matrices, As and Bs, and let's further assume that our goal is to compute the total sum of all pairwise matrix products. We can readily implement a (sequential) function that performs the necessary computations.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using LinearAlgebra: mul!, BLAS\nBLAS.set_num_threads(1) # for simplicity, we turn of OpenBLAS multithreading\n\nfunction matmulsums(As, Bs)\n    N = size(first(As), 1)\n    C = Matrix{Float64}(undef, N, N)\n    map(As, Bs) do A, B\n        mul!(C, A, B)\n        sum(C)\n    end\nend","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"matmulsums (generic function with 1 method)","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Here, we use map to perform the desired operation for each pair of matrices, A and B. However, the crucial point for our discussion is that we use the in-place matrix multiplication LinearAlgebra.mul! in conjunction with a pre-allocated output matrix C. This is to avoid the temporary allocation per \"iteration\" (i.e. per matrix pair) that we would get with C = A*B.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"For later comparison, we generate some random input data and store the result.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"As = [rand(1024, 1024) for _ in 1:64]\nBs = [rand(1024, 1024) for _ in 1:64]\n\nres = matmulsums(As, Bs);","category":"page"},{"location":"examples/tls/tls/#Parallelization","page":"Task-Local Storage","title":"Parallelization","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"The key idea for creating a parallel version of matmulsums is to replace the map by OhMyThreads' parallel tmap function. However, because we re-use C, this isn't entirely trivial.","category":"page"},{"location":"examples/tls/tls/#The-wrong-way","page":"Task-Local Storage","title":"The wrong way","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Someone new to parallel computing might be tempted to parallelize matmulsums like so:","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using OhMyThreads: tmap\n\nfunction matmulsums_race(As, Bs)\n    N = size(first(As), 1)\n    C = Matrix{Float64}(undef, N, N)\n    tmap(As, Bs) do A, B\n        mul!(C, A, B)\n        sum(C)\n    end\nend","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"matmulsums_race (generic function with 1 method)","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Unfortunately, this doesn't produce the correct result.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"res_race = matmulsums_race(As, Bs)\nres ≈ res_race","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"false","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"In fact, It doesn't even always produce the same result (check for yourself)! The reason is that there is a race condition: different parallel tasks are trying to use the shared variable C simultaneously leading to non-deterministic behavior. Let's see how we can fix this.","category":"page"},{"location":"examples/tls/tls/#The-naive-(and-inefficient)-way","page":"Task-Local Storage","title":"The naive (and inefficient) way","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"A simple solution for the race condition issue above is to move the allocation of C into the body of the parallel tmap:","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"function matmulsums_naive(As, Bs)\n    N = size(first(As), 1)\n    tmap(As, Bs) do A, B\n        C = Matrix{Float64}(undef, N, N)\n        mul!(C, A, B)\n        sum(C)\n    end\nend","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"matmulsums_naive (generic function with 1 method)","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"In this case, a separate C will be allocated for each iteration such that parallel tasks don't modify shared state anymore. Hence, we'll get the desired result.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"res_naive = matmulsums_naive(As, Bs)\nres ≈ res_naive","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"true","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"However, this variant is obviously inefficient because it is no better than just writing C = A*B and thus leads to one allocation per matrix pair. We need a different way of allocating and re-using C for an efficient parallel version.","category":"page"},{"location":"examples/tls/tls/#The-right-way:-TaskLocalValue","page":"Task-Local Storage","title":"The right way: TaskLocalValue","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"We've seen that we can't allocate C once up-front (→ race condition) and also shouldn't allocate it within the tmap (→ one allocation per iteration). What we actually want is to once allocate a separate C on each parallel task and then re-use this task-local C for all iterations (i.e. matrix pairs) that said task is responsible for.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"The way to express this idea is TaskLocalValue and looks like this:","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using OhMyThreads: TaskLocalValue\n\nfunction matmulsums_tls(As, Bs)\n    N = size(first(As), 1)\n    tls = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))\n    tmap(As, Bs) do A, B\n        C = tls[]\n        mul!(C, A, B)\n        sum(C)\n    end\nend\n\nres_tls = matmulsums_tls(As, Bs)\nres ≈ res_tls","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"true","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Here, TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N)) defines a task-local storage tls that behaves like this: The first time the storage is accessed (tls[]) from a task a task-local value is created according to the anonymous function (here, the task-local value will be a matrix) and stored in the storage. Afterwards, every other storage query from the same task(!) will simply return the task-local value. Hence, this is precisely what we need and will only lead to O(# parallel tasks) allocations.","category":"page"},{"location":"examples/tls/tls/#The-performant-but-cumbersome-way","page":"Task-Local Storage","title":"The performant but cumbersome way","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Before we benchmark and compare the performance of all discussed variants, let's implement the idea of a task-local C for each parallel task manually.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using OhMyThreads: chunks, @spawn\nusing Base.Threads: nthreads\n\nfunction matmulsums_manual(As, Bs)\n    N = size(first(As), 1)\n    tasks = map(chunks(As; n = nthreads())) do idcs\n        @spawn begin\n            local C = Matrix{Float64}(undef, N, N)\n            local results = Vector{Float64}(undef, length(idcs))\n            for (i, idx) in enumerate(idcs)\n                mul!(C, As[idx], Bs[idx])\n                results[i] = sum(C)\n            end\n            results\n        end\n    end\n    reduce(vcat, fetch.(tasks))\nend\n\nres_manual = matmulsums_manual(As, Bs)\nres ≈ res_manual","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"true","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"The first thing to note is pretty obvious: This is very cumbersome and you probably don't want to write it. But let's take a closer look and see what's happening here. First, we divide the number of matrix pairs into nthreads() chunks. Then, for each of those chunks, we spawn a parallel task that (1) allocates a task-local C matrix (and a results vector) and (2) performs the actual computations using these pre-allocated values. Finally, we fetch the results of the tasks and combine them.","category":"page"},{"location":"examples/tls/tls/#Benchmark","page":"Task-Local Storage","title":"Benchmark","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"The whole point of parallelization is increasing performance, so let's benchmark and compare the performance of the variants discussed above.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using BenchmarkTools\n\n@btime matmulsums($As, $Bs);\n@btime matmulsums_naive($As, $Bs);\n@btime matmulsums_tls($As, $Bs);\n@btime matmulsums_manual($As, $Bs);","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"  2.916 s (3 allocations: 8.00 MiB)\n  597.915 ms (174 allocations: 512.01 MiB)\n  575.507 ms (67 allocations: 40.01 MiB)\n  572.501 ms (49 allocations: 40.00 MiB)\n","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"As we see, the recommened version matmulsums_tls is both convenient as well as efficient: It allocates much less memory than matmulsums_naive and only slightly more than the manual implementation.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"This page was generated using Literate.jl.","category":"page"},{"location":"#OhMyThreads.jl","page":"OhMyThreads","title":"OhMyThreads.jl","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a focus on data parallelism, that can be used without having to worry much about manual Task creation.","category":"page"},{"location":"#Quick-Start","page":"OhMyThreads","title":"Quick Start","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The package is registered. Hence, you can simply use","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"] add OhMyThreads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"to add the package to your Julia environment.","category":"page"},{"location":"#Basic-example","page":"OhMyThreads","title":"Basic example","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"using OhMyThreads\n\nfunction mc_parallel(N; kw...)\n    M = tmapreduce(+, 1:N; kw...) do i\n        rand()^2 + rand()^2 < 1.0\n    end\n    pi = 4 * M / N\n    return pi\nend\n\nN = 100_000_000\nmc_parallel(N) # gives, e.g., 3.14159924\n\nusing BenchmarkTools\n\n@show Threads.nthreads()          # 5 in this example\n\n@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread\n@btime mc_parallel($N)            # running with all 5 Julia threads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Timings might be something like this:","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"438.394 ms (7 allocations: 624 bytes)\n88.050 ms (37 allocations: 3.02 KiB)","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"(Check out the full Parallel Monte Carlo example if you like.)","category":"page"},{"location":"#No-Transducers","page":"OhMyThreads","title":"No Transducers","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.","category":"page"},{"location":"#Acknowledgements","page":"OhMyThreads","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"EditURL = \"integration.jl\"","category":"page"},{"location":"examples/integration/integration/#Trapezoidal-Integration","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"In this example, we want to parallelize the computation of a simple numerical integral via the trapezoidal rule. The latter is given by","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"int_a^bf(x)dx approx h sum_i=1^Nfracf(x_i-1)+f(x_i)2","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The function to be integrated is the following.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f(x) = 4 * √(1 - x^2)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The analytic result of the definite integral (from 0 to 1) is known to be pi.","category":"page"},{"location":"examples/integration/integration/#Sequential","page":"Trapezoidal Integration","title":"Sequential","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Naturally, we implement the trapezoidal rule as a straightforward, sequential for loop.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"function trapezoidal(a, b, n; h = (b - a) / n)\n    y = (f(a) + f(b)) / 2.0\n    for i in 1:(n - 1)\n        x = a + i * h\n        y = y + f(x)\n    end\n    return y * h\nend","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Let's compute the integral of f above and see if we get the expected result. For simplicity, we choose N, the number of panels used to discretize the integration interval, as a multiple of the number of available Julia threads.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using Base.Threads: nthreads\n@show nthreads()\nN = nthreads() * 1_000_000","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"5000000","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Calling trapezoidal we do indeed find the (approximate) value of pi.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal(0, 1, N) ≈ π","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"examples/integration/integration/#Parallel","page":"Trapezoidal Integration","title":"Parallel","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Our strategy is the following: Divide the integration interval among the available Julia threads. On each thread, use the sequential trapezoidal rule to compute the partial integral. It is straightforward to implement this strategy with tmapreduce. The map part is, essentially, the application of trapezoidal and the reduction operator is chosen to be + to sum up the local integrals.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using OhMyThreads\n\nfunction trapezoidal_parallel(a, b, N)\n    n = N ÷ nthreads()\n    h = (b - a) / N\n    return tmapreduce(+, 1:nthreads()) do i\n        local α = a + (i - 1) * n * h\n        local β = α + n * h\n        trapezoidal(α, β, n; h)\n    end\nend","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"First, we check the correctness of our parallel implementation.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel(0, 1, N) ≈ π","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Then, we benchmark and compare the performance of the sequential and parallel versions.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using BenchmarkTools\n@btime trapezoidal(0, 1, $N);\n@btime trapezoidal_parallel(0, 1, $N);","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"  13.871 ms (0 allocations: 0 bytes)\n  2.781 ms (38 allocations: 3.19 KiB)\n","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"This page was generated using Literate.jl.","category":"page"}]
 }
diff --git a/previews/PR38/translation/index.html b/previews/PR38/translation/index.html
index 7b8604e7..e7f9d12e 100644
--- a/previews/PR38/translation/index.html
+++ b/previews/PR38/translation/index.html
@@ -43,4 +43,4 @@
     data[i] = calc(i)
 end</code></pre><pre><code class="language-julia hljs"># OhMyThreads: Variant 1
 data = tmap(i-&gt;calc(i), 1:10)</code></pre><pre><code class="language-julia hljs"># OhMyThreads: Variant 2
-data = tcollect(calc(i) for i in 1:10)</code></pre></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../examples/integration/integration/">« Trapezoidal Integration</a><a class="docs-footer-nextpage" href="../examples/tls/tls/">Task-Local Storage »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Monday 5 February 2024 17:58">Monday 5 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+data = tcollect(calc(i) for i in 1:10)</code></pre></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../examples/integration/integration/">« Trapezoidal Integration</a><a class="docs-footer-nextpage" href="../examples/tls/tls/">Task-Local Storage »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option><option value="auto">Automatic (OS)</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 1.2.1 on <span class="colophon-date" title="Tuesday 6 February 2024 16:18">Tuesday 6 February 2024</span>. Using Julia version 1.10.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>


`OhMyThreads.@spawn`	see StableTasks.jl
`OhMyThreads.@spawnat`	see StableTasks.jl