diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 19671b6..ed8de02 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-09-25T16:52:01","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-09-26T18:46:29","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/dev/basics/index.html b/dev/basics/index.html index 70fb2bb..bfd861d 100644 --- a/dev/basics/index.html +++ b/dev/basics/index.html @@ -1,2 +1,2 @@ -Basics · OhMyThreads.jl

Basics

This section is still in preparation. For now, you might want to take a look at the translation guide and the examples.

+Basics · OhMyThreads.jl

Basics

This section is still in preparation. For now, you might want to take a look at the translation guide and the examples.

diff --git a/dev/index.html b/dev/index.html index e56f13a..75546ab 100644 --- a/dev/index.html +++ b/dev/index.html @@ -32,4 +32,4 @@ @btime mc_parallel($N) # using all threads @btime mc_parallel_macro($N) # using all threads

With 5 threads, timings might be something like this:

417.282 ms (14 allocations: 912 bytes)
 83.578 ms (38 allocations: 3.08 KiB)
-83.573 ms (38 allocations: 3.08 KiB)

(Check out the full Parallel Monte Carlo example if you like.)

No Transducers

Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.

Acknowledgements

The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.

+83.573 ms (38 allocations: 3.08 KiB)

(Check out the full Parallel Monte Carlo example if you like.)

No Transducers

Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.

Acknowledgements

The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.

diff --git a/dev/literate/falsesharing/falsesharing.jl b/dev/literate/falsesharing/falsesharing.jl index 6c39592..13caa44 100644 --- a/dev/literate/falsesharing/falsesharing.jl +++ b/dev/literate/falsesharing/falsesharing.jl @@ -30,11 +30,11 @@ data = rand(1_000_000 * nthreads()); # # A common, manual implementation of this idea might look like this: -using OhMyThreads: @spawn, chunks +using OhMyThreads: @spawn, index_chunks function parallel_sum_falsesharing(data; nchunks = nthreads()) psums = zeros(eltype(data), nchunks) - @sync for (c, idcs) in enumerate(chunks(data; n = nchunks)) + @sync for (c, idcs) in enumerate(index_chunks(data; n = nchunks)) @spawn begin for i in idcs psums[c] += data[i] @@ -102,7 +102,7 @@ nthreads() function parallel_sum_tasklocal(data; nchunks = nthreads()) psums = zeros(eltype(data), nchunks) - @sync for (c, idcs) in enumerate(chunks(data; n = nchunks)) + @sync for (c, idcs) in enumerate(index_chunks(data; n = nchunks)) @spawn begin local s = zero(eltype(data)) for i in idcs @@ -131,7 +131,7 @@ end # using `map` and reusing the built-in (sequential) `sum` function on each parallel task: function parallel_sum_map(data; nchunks = nthreads()) - ts = map(chunks(data, n = nchunks)) do idcs + ts = map(index_chunks(data, n = nchunks)) do idcs @spawn @views sum(data[idcs]) end return sum(fetch.(ts)) diff --git a/dev/literate/falsesharing/falsesharing/index.html b/dev/literate/falsesharing/falsesharing/index.html index e2e2518..655bbf4 100644 --- a/dev/literate/falsesharing/falsesharing/index.html +++ b/dev/literate/falsesharing/falsesharing/index.html @@ -4,11 +4,11 @@ data = rand(1_000_000 * nthreads()); @btime sum($data);
  2.327 ms (0 allocations: 0 bytes)
-

The problematic parallel implementation

A conceptually simple (and valid) approach to parallelizing the summation is to divide the full computation into parts. Specifically, the idea is to divide the data into chunks, compute the partial sums of these chunks in parallel, and finally sum up the partial results. (Note that we will not concern ourselves with potential minor or catastrophic numerical errors due to potential rearrangements of terms in the summation here.)

A common, manual implementation of this idea might look like this:

using OhMyThreads: @spawn, chunks
+

The problematic parallel implementation

A conceptually simple (and valid) approach to parallelizing the summation is to divide the full computation into parts. Specifically, the idea is to divide the data into chunks, compute the partial sums of these chunks in parallel, and finally sum up the partial results. (Note that we will not concern ourselves with potential minor or catastrophic numerical errors due to potential rearrangements of terms in the summation here.)

A common, manual implementation of this idea might look like this:

using OhMyThreads: @spawn, index_chunks
 
 function parallel_sum_falsesharing(data; nchunks = nthreads())
     psums = zeros(eltype(data), nchunks)
-    @sync for (c, idcs) in enumerate(chunks(data; n = nchunks))
+    @sync for (c, idcs) in enumerate(index_chunks(data; n = nchunks))
         @spawn begin
             for i in idcs
                 psums[c] += data[i]
@@ -20,7 +20,7 @@
 @test sum(data) ≈ parallel_sum_falsesharing(data)
Test Passed

This is just a reflection of the fact that there is no logical sharing of data - because each parallel tasks modifies a different element of psums - implying the absence of race conditions.

What's the issue then?! Well, the sole purpose of parallelization is to reduce runtime. So let's see how well we're doing in this respect.

nthreads()
10
@btime parallel_sum_falsesharing($data);
  52.919 ms (221 allocations: 18.47 KiB)
 

A (huge) slowdown?! Clearly, that's the opposite of what we tried to achieve!

The issue: False sharing

Although our parallel summation above is semantically correct, it has a big performance issue: False sharing. To understand false sharing, we have to think a little bit about how computers work. Specifically, we need to realize that processors cache memory in lines (rather than individual elements) and that caches of different processors are kept coherent. When two (or more) different CPU cores operate on independent data elements that fall into the same cache line (i.e. they are part of the same memory address region) the cache coherency mechanism leads to costly synchronization between cores.

In our case, this happens despite the fact that different parallel tasks (on different CPU cores) logically don't care about the rest of the data in the cache line at all.

Given these insights, we can come up with a few workarounds that mitigate the issue. The most prominent is probably padding, where one simply adds sufficiently many unused zeros to psums such that different partial sum counters don't fall into the same cache line. However, let's discuss a more fundamental, more efficient, and more elegant solution.

Task-local parallel summation

The key mistake in parallel_sum_falsesharing above is the non-local modification of (implicitly) shared state (cache lines of psums) very frequently (in the innermost loop). We can simply avoid this by making the code more task-local. To this end, we introduce a task-local accumulator variable, which we use to perform the task-local partial sums. Only at the very end do we communicate the result to the main thread, e.g. by writing it into psums (once!).

function parallel_sum_tasklocal(data; nchunks = nthreads())
     psums = zeros(eltype(data), nchunks)
-    @sync for (c, idcs) in enumerate(chunks(data; n = nchunks))
+    @sync for (c, idcs) in enumerate(index_chunks(data; n = nchunks))
         @spawn begin
             local s = zero(eltype(data))
             for i in idcs
@@ -35,7 +35,7 @@
 @test sum(data) ≈ parallel_sum_tasklocal(data)
 @btime parallel_sum_tasklocal($data);
  1.120 ms (221 allocations: 18.55 KiB)
 

Finally, there is a speed up! 🎉

Two comments are in order.

First, we note that the only role that psums plays is as a temporary storage for the results from the parallel tasks to be able to sum them up eventually. We could get rid of it entirely by using a Threads.Atomic instead which would get updated via Threads.atomic_add! from each task directly. However, for our discussion, this is a detail and we won't discuss it further.

Secondly, while keeping the general idea, we can drastically simplify the above code by using map and reusing the built-in (sequential) sum function on each parallel task:

function parallel_sum_map(data; nchunks = nthreads())
-    ts = map(chunks(data, n = nchunks)) do idcs
+    ts = map(index_chunks(data, n = nchunks)) do idcs
         @spawn @views sum(data[idcs])
     end
     return sum(fetch.(ts))
@@ -47,4 +47,4 @@
 
 @test sum(data) ≈ treduce(+, data; ntasks = nthreads())
 @btime treduce($+, $data; ntasks = $nthreads());
  899.097 μs (68 allocations: 5.92 KiB)
-

This page was generated using Literate.jl.

+

This page was generated using Literate.jl.

diff --git a/dev/literate/integration/integration/index.html b/dev/literate/integration/integration/index.html index 5a16529..c22e0d4 100644 --- a/dev/literate/integration/integration/index.html +++ b/dev/literate/integration/integration/index.html @@ -35,4 +35,4 @@ @btime trapezoidal(0, 1, $N); @btime trapezoidal_parallel(0, 1, $N);
  24.348 ms (0 allocations: 0 bytes)
   2.457 ms (69 allocations: 6.05 KiB)
-

Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.

nthreads()
10

This page was generated using Literate.jl.

+

Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.

nthreads()
10

This page was generated using Literate.jl.

diff --git a/dev/literate/juliaset/juliaset/index.html b/dev/literate/juliaset/juliaset/index.html index 38f6f84..6a615bc 100644 --- a/dev/literate/juliaset/juliaset/index.html +++ b/dev/literate/juliaset/juliaset/index.html @@ -71,4 +71,4 @@

Note that while this turns out to be a bit faster, it comes at the expense of much more allocations.

To quantify the impact of load balancing we can opt out of dynamic scheduling and use the StaticScheduler instead. The latter doesn't provide any form of load balancing.

using OhMyThreads: StaticScheduler
 
 @btime compute_juliaset_parallel!($img; scheduler=:static) samples=10 evals=3;
  30.097 ms (73 allocations: 6.23 KiB)
-

This page was generated using Literate.jl.

+

This page was generated using Literate.jl.

diff --git a/dev/literate/mc/mc.jl b/dev/literate/mc/mc.jl index 6a9abd3..4ef3381 100644 --- a/dev/literate/mc/mc.jl +++ b/dev/literate/mc/mc.jl @@ -79,15 +79,15 @@ using OhMyThreads: StaticScheduler # ## Manual parallelization # -# First, using the `chunks` function, we divide the iteration interval `1:N` into +# First, using the `index_chunks` function, we divide the iteration interval `1:N` into # `nthreads()` parts. Then, we apply a regular (sequential) `map` to spawn a Julia task # per chunk. Each task will locally and independently perform a sequential Monte Carlo # simulation. Finally, we fetch the results and compute the average estimate for $\pi$. -using OhMyThreads: @spawn, chunks +using OhMyThreads: @spawn, index_chunks function mc_parallel_manual(N; nchunks = nthreads()) - tasks = map(chunks(1:N; n = nchunks)) do idcs + tasks = map(index_chunks(1:N; n = nchunks)) do idcs @spawn mc(length(idcs)) end pi = sum(fetch, tasks) / nchunks @@ -104,7 +104,7 @@ mc_parallel_manual(N) # `mc(length(idcs))` is faster than the implicit task-local computation within # `tmapreduce` (which itself is a `mapreduce`). -idcs = first(chunks(1:N; n = nthreads())) +idcs = first(index_chunks(1:N; n = nthreads())) @btime mapreduce($+, $idcs) do i rand()^2 + rand()^2 < 1.0 diff --git a/dev/literate/mc/mc/index.html b/dev/literate/mc/mc/index.html index 4fa1ccb..899888e 100644 --- a/dev/literate/mc/mc/index.html +++ b/dev/literate/mc/mc/index.html @@ -48,10 +48,10 @@ @btime mc_parallel($N; scheduler=:dynamic) samples=10 evals=3; # default @btime mc_parallel($N; scheduler=:static) samples=10 evals=3;
  41.839 ms (68 allocations: 5.81 KiB)
   41.838 ms (68 allocations: 5.81 KiB)
-

Manual parallelization

First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for $\pi$.

using OhMyThreads: @spawn, chunks
+

Manual parallelization

First, using the index_chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for $\pi$.

using OhMyThreads: @spawn, index_chunks
 
 function mc_parallel_manual(N; nchunks = nthreads())
-    tasks = map(chunks(1:N; n = nchunks)) do idcs
+    tasks = map(index_chunks(1:N; n = nchunks)) do idcs
         @spawn mc(length(idcs))
     end
     pi = sum(fetch, tasks) / nchunks
@@ -59,7 +59,7 @@
 end
 
 mc_parallel_manual(N)
3.14180504

And this is the performance:

@btime mc_parallel_manual($N) samples=10 evals=3;
  30.224 ms (65 allocations: 5.70 KiB)
-

It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).

idcs = first(chunks(1:N; n = nthreads()))
+

It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).

idcs = first(index_chunks(1:N; n = nthreads()))
 
 @btime mapreduce($+, $idcs) do i
     rand()^2 + rand()^2 < 1.0
@@ -67,4 +67,4 @@
 
 @btime mc($(length(idcs))) samples=10 evals=3;
  41.750 ms (0 allocations: 0 bytes)
   30.148 ms (0 allocations: 0 bytes)
-

This page was generated using Literate.jl.

+

This page was generated using Literate.jl.

diff --git a/dev/literate/tls/tls.jl b/dev/literate/tls/tls.jl index 20c77ad..369776e 100644 --- a/dev/literate/tls/tls.jl +++ b/dev/literate/tls/tls.jl @@ -102,12 +102,12 @@ res ≈ res_naive # iterations (i.e. matrix pairs) for which this task is responsible. # Before we learn how to do this more conveniently, let's implement this idea of a # task-local temporary buffer (for each parallel task) manually. -using OhMyThreads: chunks, @spawn +using OhMyThreads: index_chunks, @spawn using Base.Threads: nthreads function matmulsums_manual(As, Bs) N = size(first(As), 1) - tasks = map(chunks(As; n = 2 * nthreads())) do idcs + tasks = map(index_chunks(As; n = 2 * nthreads())) do idcs @spawn begin local C = Matrix{Float64}(undef, N, N) map(idcs) do i diff --git a/dev/literate/tls/tls/index.html b/dev/literate/tls/tls/index.html index 96940b9..761d331 100644 --- a/dev/literate/tls/tls/index.html +++ b/dev/literate/tls/tls/index.html @@ -30,12 +30,12 @@ sum(C) end end
matmulsums_naive (generic function with 1 method)

In this case, a separate C will be allocated for each iteration such that parallel tasks no longer mutate shared state. Hence, we'll get the desired result.

res_naive = matmulsums_naive(As, Bs)
-res ≈ res_naive
true

However, this variant is obviously inefficient because it is no better than just writing C = A*B and thus leads to one allocation per matrix pair. We need a different way of allocating and re-using C for an efficient parallel version.

Task-local storage

The manual (and cumbersome) way

We've seen that we can't allocate C once up-front (→ race condition) and also shouldn't allocate it within the tmap (→ one allocation per iteration). Instead, we can assign a separate "C" on each parallel task once and then use this task-local "C" for all iterations (i.e. matrix pairs) for which this task is responsible. Before we learn how to do this more conveniently, let's implement this idea of a task-local temporary buffer (for each parallel task) manually.

using OhMyThreads: chunks, @spawn
+res ≈ res_naive
true

However, this variant is obviously inefficient because it is no better than just writing C = A*B and thus leads to one allocation per matrix pair. We need a different way of allocating and re-using C for an efficient parallel version.

Task-local storage

The manual (and cumbersome) way

We've seen that we can't allocate C once up-front (→ race condition) and also shouldn't allocate it within the tmap (→ one allocation per iteration). Instead, we can assign a separate "C" on each parallel task once and then use this task-local "C" for all iterations (i.e. matrix pairs) for which this task is responsible. Before we learn how to do this more conveniently, let's implement this idea of a task-local temporary buffer (for each parallel task) manually.

using OhMyThreads: index_chunks, @spawn
 using Base.Threads: nthreads
 
 function matmulsums_manual(As, Bs)
     N = size(first(As), 1)
-    tasks = map(chunks(As; n = 2 * nthreads())) do idcs
+    tasks = map(index_chunks(As; n = 2 * nthreads())) do idcs
         @spawn begin
             local C = Matrix{Float64}(undef, N, N)
             map(idcs) do i
@@ -194,4 +194,4 @@
 sort(res) ≈ sort(res_bumper)
 
 @btime matmulsums_bumper($As, $Bs);
  7.814 ms (134 allocations: 27.92 KiB)
-

Note that the benchmark is lying here about the total memory allocation, because it doesn't show the allocation of the task-local bump allocators themselves (the reason is that SlabBuffer uses malloc directly).


This page was generated using Literate.jl.

+

Note that the benchmark is lying here about the total memory allocation, because it doesn't show the allocation of the task-local bump allocators themselves (the reason is that SlabBuffer uses malloc directly).


This page was generated using Literate.jl.

diff --git a/dev/objects.inv b/dev/objects.inv index 6273ca8..95c7931 100644 Binary files a/dev/objects.inv and b/dev/objects.inv differ diff --git a/dev/refs/api/index.html b/dev/refs/api/index.html index 83c39f7..a40931c 100644 --- a/dev/refs/api/index.html +++ b/dev/refs/api/index.html @@ -1,5 +1,5 @@ -Public API · OhMyThreads.jl

Public API

Exported

Macros

OhMyThreads.@tasksMacro
@tasks for ... end

A macro to parallelize a for loop by spawning a set of tasks that can be run in parallel. The policy of how many tasks to spawn and how to distribute the iteration space among the tasks (and more) can be configured via @set statements in the loop body.

Supports reductions (@set reducer=<reducer function>) and collecting the results (@set collect=true).

Under the hood, the for loop is translated into corresponding parallel tforeach, tmapreduce, or tmap calls.

See also: @set, @local

Examples

using OhMyThreads: @tasks
@tasks for i in 1:3
+Public API · OhMyThreads.jl

Public API

Exported

Macros

OhMyThreads.@tasksMacro
@tasks for ... end

A macro to parallelize a for loop by spawning a set of tasks that can be run in parallel. The policy of how many tasks to spawn and how to distribute the iteration space among the tasks (and more) can be configured via @set statements in the loop body.

Supports reductions (@set reducer=<reducer function>) and collecting the results (@set collect=true).

Under the hood, the for loop is translated into corresponding parallel tforeach, tmapreduce, or tmap calls.

See also: @set, @local

Examples

using OhMyThreads: @tasks
@tasks for i in 1:3
     println(i)
 end
@tasks for x in rand(10)
     @set reducer=+
@@ -19,7 +19,7 @@
         chunksize=10
     end
     println("i=", i, " → ", threadid())
-end
source
OhMyThreads.@setMacro
@set name = value

This can be used inside a @tasks for ... end block to specify settings for the parallel execution of the loop.

Multiple settings are supported, either as separate @set statements or via @set begin ... end.

Settings

  • reducer (e.g. reducer=+): Indicates that a reduction should be performed with the provided binary function. See tmapreduce for more information.
  • collect (e.g. collect=true): Indicates that results should be collected (similar to map).

All other settings will be passed on to the underlying parallel functions (e.g. tmapreduce) as keyword arguments. Hence, you may provide whatever these functions accept as keyword arguments. Among others, this includes

  • scheduler (e.g. scheduler=:static): Can be either a Scheduler or a Symbol (e.g. :dynamic, :static, :serial, or :greedy).
  • init (e.g. init=0.0): Initial value to be used in a reduction (requires reducer=...).

Settings like ntasks, chunksize, and split etc. can be used to tune the scheduling policy (if the selected scheduler supports it).

source
OhMyThreads.@setMacro
@set name = value

This can be used inside a @tasks for ... end block to specify settings for the parallel execution of the loop.

Multiple settings are supported, either as separate @set statements or via @set begin ... end.

Settings

  • reducer (e.g. reducer=+): Indicates that a reduction should be performed with the provided binary function. See tmapreduce for more information.
  • collect (e.g. collect=true): Indicates that results should be collected (similar to map).

All other settings will be passed on to the underlying parallel functions (e.g. tmapreduce) as keyword arguments. Hence, you may provide whatever these functions accept as keyword arguments. Among others, this includes

  • scheduler (e.g. scheduler=:static): Can be either a Scheduler or a Symbol (e.g. :dynamic, :static, :serial, or :greedy).
  • init (e.g. init=0.0): Initial value to be used in a reduction (requires reducer=...).

Settings like ntasks, chunksize, and split etc. can be used to tune the scheduling policy (if the selected scheduler supports it).

source
OhMyThreads.@localMacro
@local name = value
 
 @local name::T = value

Can be used inside a @tasks for ... end block to specify task-local values (TLV) via explicitly typed assignments. These values will be allocated once per task (rather than once per iteration) and can be re-used between different task-local iterations.

There can only be a single @local block in a @tasks for ... end block. To specify multiple TLVs, use @local begin ... end. Compared to regular assignments, there are some limitations though, e.g. TLVs can't reference each other.

Examples

using OhMyThreads: @tasks
 using OhMyThreads.Tools: taskid
@@ -42,7 +42,7 @@
 end

Task local variables created by @local are by default constrained to their inferred type, but if you need to, you can specify a different type during declaration:

@tasks for i in 1:10
     @local x::Vector{Float64} = some_hard_to_infer_setup_function()
     # ...
-end
source
OhMyThreads.@only_oneMacro
@only_one begin ... end

This can be used inside a @tasks for ... end block to mark a region of code to be executed by only one of the parallel tasks (all other tasks skip over this region).

Example

using OhMyThreads: @tasks
+end
source
OhMyThreads.@only_oneMacro
@only_one begin ... end

This can be used inside a @tasks for ... end block to mark a region of code to be executed by only one of the parallel tasks (all other tasks skip over this region).

Example

using OhMyThreads: @tasks
 
 @tasks for i in 1:10
     @set ntasks = 10
@@ -53,7 +53,7 @@
         sleep(1)
     end
     println(i, ": after")
-end
source
OhMyThreads.@one_by_oneMacro
@one_by_one begin ... end

This can be used inside a @tasks for ... end block to mark a region of code to be executed by one parallel task at a time (i.e. exclusive access). The order may be arbitrary and non-deterministic.

Example

using OhMyThreads: @tasks
+end
source
OhMyThreads.@one_by_oneMacro
@one_by_one begin ... end

This can be used inside a @tasks for ... end block to mark a region of code to be executed by one parallel task at a time (i.e. exclusive access). The order may be arbitrary and non-deterministic.

Example

using OhMyThreads: @tasks
 
 @tasks for i in 1:10
     @set ntasks = 10
@@ -64,21 +64,21 @@
         sleep(0.5)
     end
     println(i, ": after")
-end
source

Functions

Functions

OhMyThreads.tmapreduceFunction
tmapreduce(f, op, A::AbstractArray...;
            [scheduler::Union{Scheduler, Symbol} = :dynamic],
            [outputtype::Type = Any],
            [init])

A multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

Example:

using OhMyThreads: tmapreduce
 
-tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

(√1 + √2) + (√3 + √4) + √5

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.
  • outputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.
  • init: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.

In addition, tmapreduce accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

tmapreduce(√, +, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.treduceFunction
treduce(op, A::AbstractArray...;
+tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

(√1 + √2) + (√3 + √4) + √5

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.
  • outputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.
  • init: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.

In addition, tmapreduce accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

tmapreduce(√, +, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.treduceFunction
treduce(op, A::AbstractArray...;
         [scheduler::Union{Scheduler, Symbol} = :dynamic],
         [outputtype::Type = Any],
         [init])

A multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

Example:

using OhMyThreads: treduce
 
-treduce(+, [1, 2, 3, 4, 5])

is the parallelized version of sum([1, 2, 3, 4, 5]) in the form

(1 + 2) + (3 + 4) + 5

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.
  • outputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.
  • init: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.

In addition, treduce accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

treduce(+, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.tmapFunction
tmap(f, [OutputElementType], A::AbstractArray...;
+treduce(+, [1, 2, 3, 4, 5])

is the parallelized version of sum([1, 2, 3, 4, 5]) in the form

(1 + 2) + (3 + 4) + 5

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.
  • outputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.
  • init: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.

In addition, treduce accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

treduce(+, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.tmapFunction
tmap(f, [OutputElementType], A::AbstractArray...;
      [scheduler::Union{Scheduler, Symbol} = :dynamic])

A multithreaded function like Base.map. Create a new container similar to A and fills it in parallel such that the ith element is equal to f(A[i]).

The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

Example:

using OhMyThreads: tmap
 
-tmap(sin, 1:10)

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.

In addition, tmap accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

tmap(sin, 1:10; chunksize=2, scheduler=:static)

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.tmap!Function
tmap!(f, out, A::AbstractArray...;
-      [scheduler::Union{Scheduler, Symbol} = :dynamic])

A multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.

In addition, tmap! accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.tforeachFunction
tforeach(f, A::AbstractArray...;
+tmap(sin, 1:10)

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.

In addition, tmap accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

tmap(sin, 1:10; chunksize=2, scheduler=:static)

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.tmap!Function
tmap!(f, out, A::AbstractArray...;
+      [scheduler::Union{Scheduler, Symbol} = :dynamic])

A multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.

In addition, tmap! accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.tforeachFunction
tforeach(f, A::AbstractArray...;
          [scheduler::Union{Scheduler, Symbol} = :dynamic]) :: Nothing

A multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing. I.e. it is the parallel equivalent of

for x in A
     f(x)
 end

Example:

using OhMyThreads: tforeach
@@ -87,15 +87,15 @@
     println(i^2)
 end

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.

In addition, tforeach accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

tforeach(1:10; chunksize=2, scheduler=:static) do i
     println(i^2)
-end

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.tcollectFunction
tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
+end

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.tcollectFunction
tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
          [scheduler::Union{Scheduler, Symbol} = :dynamic])

A multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.

The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

Example:

using OhMyThreads: tcollect
 
-tcollect(sin(i) for i in 1:10)

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.

In addition, tcollect accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

tcollect(sin(i) for i in 1:10; chunksize=2, scheduler=:static)

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.treducemapFunction
treducemap(op, f, A::AbstractArray...;
+tcollect(sin(i) for i in 1:10)

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.

In addition, tcollect accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

tcollect(sin(i) for i in 1:10; chunksize=2, scheduler=:static)

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source
OhMyThreads.treducemapFunction
treducemap(op, f, A::AbstractArray...;
            [scheduler::Union{Scheduler, Symbol} = :dynamic],
            [outputtype::Type = Any],
            [init])

Like tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

Example:

using OhMyThreads: treducemap
 
-treducemap(+, √, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

(√1 + √2) + (√3 + √4) + √5

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.
  • outputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.
  • init: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.

In addition, treducemap accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

treducemap(+, √, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source

Schedulers

OhMyThreads.Schedulers.DynamicSchedulerType
DynamicScheduler (aka :dynamic)

The default dynamic scheduler. Divides the given collection into chunks and then spawns a task per chunk to perform the requested operation in parallel. The tasks are assigned to threads by Julia's dynamic scheduler and are non-sticky, that is, they can migrate between threads.

Generally preferred since it is flexible, can provide load balancing, and is composable with other multithreaded code.

Keyword arguments:

  • nchunks::Integer or ntasks::Integer (default nthreads(threadpool)):
    • Determines the number of chunks (and thus also the number of parallel tasks).
    • Increasing nchunks can help with load balancing, but at the expense of creating more overhead. For nchunks <= nthreads() there are not enough chunks for any load balancing.
    • Setting nchunks < nthreads() is an effective way to use only a subset of the available threads.
  • chunksize::Integer (default not set)
    • Specifies the desired chunk size (instead of the number of chunks).
    • The options chunksize and nchunks/ntasks are mutually exclusive (only one may be a positive integer).
  • split::Symbol (default :batch):
    • Determines how the collection is divided into chunks (if chunking=true). By default, each chunk consists of contiguous elements and order is maintained.
    • See ChunkSplitters.jl for more details and available options.
    • Beware that for split=:scatter the order of elements isn't maintained and a reducer function must not only be associative but also commutative!
  • chunking::Bool (default true):
    • Controls whether input elements are grouped into chunks (true) or not (false).
    • For chunking=false, the arguments nchunks/ntasks, chunksize, and split are ignored and input elements are regarded as "chunks" as is. Hence, there will be one parallel task spawned per input element. Note that, depending on the input, this might spawn many(!) tasks and can be costly!
  • threadpool::Symbol (default :default):
    • Possible options are :default and :interactive.
    • The high-priority pool :interactive should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes.
source
OhMyThreads.Schedulers.StaticSchedulerType
StaticScheduler (aka :static)

A static low-overhead scheduler. Divides the given collection into chunks and then spawns a task per chunk to perform the requested operation in parallel. The tasks are statically assigned to threads up front and are made sticky, that is, they are guaranteed to stay on the assigned threads (no task migration).

Can sometimes be more performant than DynamicScheduler when the workload is (close to) uniform and, because of the lower overhead, for small workloads. Isn't well composable with other multithreaded code though.

Keyword arguments:

  • nchunks::Integer or ntasks::Integer (default nthreads()):
    • Determines the number of chunks (and thus also the number of parallel tasks).
    • Setting nchunks < nthreads() is an effective way to use only a subset of the available threads.
    • For nchunks > nthreads() the chunks will be distributed to the available threads in a round-robin fashion.
  • chunksize::Integer (default not set)
    • Specifies the desired chunk size (instead of the number of chunks).
    • The options chunksize and nchunks/ntasks are mutually exclusive (only one may be non-zero).
  • chunking::Bool (default true):
    • Controls whether input elements are grouped into chunks (true) or not (false).
    • For chunking=false, the arguments nchunks/ntasks, chunksize, and split are ignored and input elements are regarded as "chunks" as is. Hence, there will be one parallel task spawned per input element. Note that, depending on the input, this might spawn many(!) tasks and can be costly!
  • split::Symbol (default :batch):
    • Determines how the collection is divided into chunks. By default, each chunk consists of contiguous elements and order is maintained.
    • See ChunkSplitters.jl for more details and available options.
    • Beware that for split=:scatter the order of elements isn't maintained and a reducer function must not only be associative but also commutative!
source
OhMyThreads.Schedulers.GreedySchedulerType
GreedyScheduler (aka :greedy)

A greedy dynamic scheduler. The elements of the collection are first put into a Channel and then dynamic, non-sticky tasks are spawned to process the channel content in parallel.

Note that elements are processed in a non-deterministic order, and thus a potential reducing function must be commutative in addition to being associative, or you could get incorrect results!

Can be good choice for load-balancing slower, uneven computations, but does carry some additional overhead.

Keyword arguments:

  • ntasks::Int (default nthreads()):
    • Determines the number of parallel tasks to be spawned.
    • Setting ntasks < nthreads() is an effective way to use only a subset of the available threads.
  • chunking::Bool (default false):
    • Controls whether input elements are grouped into chunks (true) or not (false) before put into the channel. This can improve the performance especially if there are many iterations each of which are computationally cheap.
    • If nchunks or chunksize are explicitly specified, chunking will be automatically set to true.
  • nchunks::Integer (default 10 * nthreads()):
    • Determines the number of chunks (that will eventually be put into the channel).
    • Increasing nchunks can help with load balancing. For nchunks <= nthreads() there are not enough chunks for any load balancing.
  • chunksize::Integer (default not set)
    • Specifies the desired chunk size (instead of the number of chunks).
    • The options chunksize and nchunks are mutually exclusive (only one may be a positive integer).
  • split::Symbol (default :scatter):
    • Determines how the collection is divided into chunks (if chunking=true).
    • See ChunkSplitters.jl for more details and available options.
source
OhMyThreads.Schedulers.SerialSchedulerType
SerialScheduler (aka :serial)

A scheduler for turning off any multithreading and running the code in serial. It aims to make parallel functions like, e.g., tmapreduce(sin, +, 1:100) behave like their serial counterparts, e.g., mapreduce(sin, +, 1:100).

source

Non-Exported

OhMyThreads.@spawnsee StableTasks.jl
OhMyThreads.@spawnatsee StableTasks.jl
OhMyThreads.@fetchsee StableTasks.jl
OhMyThreads.@fetchfromsee StableTasks.jl
OhMyThreads.chunkssee ChunkSplitters.jl
OhMyThreads.TaskLocalValuesee TaskLocalValues.jl
OhMyThreads.WithTaskLocalsType
struct WithTaskLocals{F, TLVs <: Tuple{Vararg{TaskLocalValue}}} <: Function

This callable function-like object is meant to represent a function which closes over some TaskLocalValues. This is, if you do

TLV{T} = TaskLocalValue{T}
+treducemap(+, √, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

(√1 + √2) + (√3 + √4) + √5

Keyword arguments:

  • scheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.
  • outputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.
  • init: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.

In addition, treducemap accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:

treducemap(+, √, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)

However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).

source

Schedulers

OhMyThreads.Schedulers.DynamicSchedulerType
DynamicScheduler (aka :dynamic)

The default dynamic scheduler. Divides the given collection into chunks and then spawns a task per chunk to perform the requested operation in parallel. The tasks are assigned to threads by Julia's dynamic scheduler and are non-sticky, that is, they can migrate between threads.

Generally preferred since it is flexible, can provide load balancing, and is composable with other multithreaded code.

Keyword arguments:

  • nchunks::Integer or ntasks::Integer (default nthreads(threadpool)):
    • Determines the number of chunks (and thus also the number of parallel tasks).
    • Increasing nchunks can help with load balancing, but at the expense of creating more overhead. For nchunks <= nthreads() there are not enough chunks for any load balancing.
    • Setting nchunks < nthreads() is an effective way to use only a subset of the available threads.
  • chunksize::Integer (default not set)
    • Specifies the desired chunk size (instead of the number of chunks).
    • The options chunksize and nchunks/ntasks are mutually exclusive (only one may be a positive integer).
  • split::Union{Symbol, OhMyThreads.Split} (default OhMyThreads.Consecutive()):
    • Determines how the collection is divided into chunks (if chunking=true). By default, each chunk consists of contiguous elements and order is maintained.
    • See ChunkSplitters.jl for more details and available options. We also allow users to pass :consecutive in place of Consecutive(), and :roundrobin in place of RoundRobin()
    • Beware that for split=OhMyThreads.RoundRobin() the order of elements isn't maintained and a reducer function must not only be associative but also commutative!
  • chunking::Bool (default true):
    • Controls whether input elements are grouped into chunks (true) or not (false).
    • For chunking=false, the arguments nchunks/ntasks, chunksize, and split are ignored and input elements are regarded as "chunks" as is. Hence, there will be one parallel task spawned per input element. Note that, depending on the input, this might spawn many(!) tasks and can be costly!
  • threadpool::Symbol (default :default):
    • Possible options are :default and :interactive.
    • The high-priority pool :interactive should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes.
source
OhMyThreads.Schedulers.StaticSchedulerType
StaticScheduler (aka :static)

A static low-overhead scheduler. Divides the given collection into chunks and then spawns a task per chunk to perform the requested operation in parallel. The tasks are statically assigned to threads up front and are made sticky, that is, they are guaranteed to stay on the assigned threads (no task migration).

Can sometimes be more performant than DynamicScheduler when the workload is (close to) uniform and, because of the lower overhead, for small workloads. Isn't well composable with other multithreaded code though.

Keyword arguments:

  • nchunks::Integer or ntasks::Integer (default nthreads()):
    • Determines the number of chunks (and thus also the number of parallel tasks).
    • Setting nchunks < nthreads() is an effective way to use only a subset of the available threads.
    • For nchunks > nthreads() the chunks will be distributed to the available threads in a round-robin fashion.
  • chunksize::Integer (default not set)
    • Specifies the desired chunk size (instead of the number of chunks).
    • The options chunksize and nchunks/ntasks are mutually exclusive (only one may be non-zero).
  • chunking::Bool (default true):
    • Controls whether input elements are grouped into chunks (true) or not (false).
    • For chunking=false, the arguments nchunks/ntasks, chunksize, and split are ignored and input elements are regarded as "chunks" as is. Hence, there will be one parallel task spawned per input element. Note that, depending on the input, this might spawn many(!) tasks and can be costly!
  • split::Union{Symbol, OhMyThreads.Split} (default OhMyThreads.Consecutive()):
    • Determines how the collection is divided into chunks. By default, each chunk consists of contiguous elements and order is maintained.
    • See ChunkSplitters.jl for more details and available options. We also allow users to pass :consecutive in place of Consecutive(), and :roundrobin in place of RoundRobin()
    • Beware that for split=OhMyThreads.RoundRobin() the order of elements isn't maintained and a reducer function must not only be associative but also commutative!
source
OhMyThreads.Schedulers.GreedySchedulerType
GreedyScheduler (aka :greedy)

A greedy dynamic scheduler. The elements of the collection are first put into a Channel and then dynamic, non-sticky tasks are spawned to process the channel content in parallel.

Note that elements are processed in a non-deterministic order, and thus a potential reducing function must be commutative in addition to being associative, or you could get incorrect results!

Can be good choice for load-balancing slower, uneven computations, but does carry some additional overhead.

Keyword arguments:

  • ntasks::Int (default nthreads()):
    • Determines the number of parallel tasks to be spawned.
    • Setting ntasks < nthreads() is an effective way to use only a subset of the available threads.
  • chunking::Bool (default false):
    • Controls whether input elements are grouped into chunks (true) or not (false) before put into the channel. This can improve the performance especially if there are many iterations each of which are computationally cheap.
    • If nchunks or chunksize are explicitly specified, chunking will be automatically set to true.
  • nchunks::Integer (default 10 * nthreads()):
    • Determines the number of chunks (that will eventually be put into the channel).
    • Increasing nchunks can help with load balancing. For nchunks <= nthreads() there are not enough chunks for any load balancing.
  • chunksize::Integer (default not set)
    • Specifies the desired chunk size (instead of the number of chunks).
    • The options chunksize and nchunks are mutually exclusive (only one may be a positive integer).
  • split::Union{Symbol, OhMyThreads.Split} (default OhMyThreads.RoundRobin()):
    • Determines how the collection is divided into chunks (if chunking=true).
    • See ChunkSplitters.jl for more details and available options. We also allow users to pass :consecutive in place of Consecutive(), and :roundrobin in place of RoundRobin()
source
OhMyThreads.Schedulers.SerialSchedulerType
SerialScheduler (aka :serial)

A scheduler for turning off any multithreading and running the code in serial. It aims to make parallel functions like, e.g., tmapreduce(sin, +, 1:100) behave like their serial counterparts, e.g., mapreduce(sin, +, 1:100).

source

Re-exported

OhMyThreads.chunkssee ChunkSplitters.jl
OhMyThreads.index_chunkssee ChunkSplitters.jl

Public but not exported

OhMyThreads.@spawnsee StableTasks.jl
OhMyThreads.@spawnatsee StableTasks.jl
OhMyThreads.@fetchsee StableTasks.jl
OhMyThreads.@fetchfromsee StableTasks.jl
OhMyThreads.TaskLocalValuesee TaskLocalValues.jl
OhMyThreads.Splitsee ChunkSplitters.jl
OhMyThreads.Consecutivesee ChunkSplitters.jl
OhMyThreads.RoundRobinsee ChunkSplitters.jl
OhMyThreads.WithTaskLocalsType
struct WithTaskLocals{F, TLVs <: Tuple{Vararg{TaskLocalValue}}} <: Function

This callable function-like object is meant to represent a function which closes over some TaskLocalValues. This is, if you do

TLV{T} = TaskLocalValue{T}
 f = WithTaskLocals((TLV{Int}(() -> 1), TLV{Int}(() -> 2))) do (x, y)
     z -> (x + y)/z
 end

then that is equivalent to

g = let x = TLV{Int}(() -> 1), y = TLV{Int}(() -> 2)
@@ -104,7 +104,7 @@
     end
 end

however, the main difference is that you can call promise_task_local on a WithTaskLocals closure in order to turn it into something equivalent to

let x=x[], y=y[]
     z -> (x + y)/z
-end

which doesn't have the overhead of accessing the task_local_storage each time the closure is called. This of course will lose the safety advantages of TaskLocalValue, so you should never do f_local = promise_task_local(f) and then pass f_local to some unknown function, because if that unknown function calls f_local on a new task, you'll hit a race condition.

source
OhMyThreads.promise_task_localFunction
promise_task_local(f) = f
+end

which doesn't have the overhead of accessing the task_local_storage each time the closure is called. This of course will lose the safety advantages of TaskLocalValue, so you should never do f_local = promise_task_local(f) and then pass f_local to some unknown function, because if that unknown function calls f_local on a new task, you'll hit a race condition.

source
OhMyThreads.promise_task_localFunction
promise_task_local(f) = f
 promise_task_local(f::WithTaskLocals) = f.inner_func(map(x -> x[], f.tasklocals))

Take a WithTaskLocals closure, grab the TaskLocalValues, and passs them to the closure. That is, it turns a WithTaskLocals closure from the equivalent of

TLV{T} = TaskLocalValue{T}
 let x = TLV{Int}(() -> 1), y = TLV{Int}(() -> 2)
     z -> let x = x[], y=y[]
@@ -114,4 +114,4 @@
     let x = x[], y = y[]
         z -> (x + y)/z
     end
-end

which doesn't have the overhead of accessing the task_local_storage each time the closure is called. This of course will lose the safety advantages of TaskLocalValue, so you should never do f_local = promise_task_local(f) and then pass f_local to some unknown function, because if that unknown function calls f_local on a new task, you'll hit a race condition. ```

source
+end

which doesn't have the overhead of accessing the task_local_storage each time the closure is called. This of course will lose the safety advantages of TaskLocalValue, so you should never do f_local = promise_task_local(f) and then pass f_local to some unknown function, because if that unknown function calls f_local on a new task, you'll hit a race condition. ```

source
diff --git a/dev/refs/experimental/index.html b/dev/refs/experimental/index.html index 1323cf9..1f4d86f 100644 --- a/dev/refs/experimental/index.html +++ b/dev/refs/experimental/index.html @@ -18,4 +18,4 @@ println(i, ": before") @barrier println(i, ": after") -endsource
+endsource
diff --git a/dev/refs/internal/index.html b/dev/refs/internal/index.html index c9e68ec..497c9b0 100644 --- a/dev/refs/internal/index.html +++ b/dev/refs/internal/index.html @@ -1,5 +1,5 @@ -Internal · OhMyThreads.jl

Internal

Warning

Everything on this page is internal and and might changed or dropped at any point!

References

OhMyThreads.Tools.SimpleBarrierType

SimpleBarrier(n::Integer)

Simple reusable barrier for n parallel tasks.

Given b = SimpleBarrier(n) and n parallel tasks, each task that calls wait(b) will block until the other n-1 tasks have called wait(b) as well.

Example

n = nthreads()
+Internal · OhMyThreads.jl

Internal

Warning

Everything on this page is internal and and might changed or dropped at any point!

References

OhMyThreads.Tools.SimpleBarrierType

SimpleBarrier(n::Integer)

Simple reusable barrier for n parallel tasks.

Given b = SimpleBarrier(n) and n parallel tasks, each task that calls wait(b) will block until the other n-1 tasks have called wait(b) as well.

Example

n = nthreads()
 barrier = SimpleBarrier(n)
 @sync for i in 1:n
     @spawn begin
@@ -9,7 +9,7 @@
         wait(barrier) # synchronize all tasks (reusable)
         println("C")
     end
-end
source
OhMyThreads.Tools.taskidMethod
taskid() :: UInt

Return a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist.

source
OhMyThreads.Tools.try_enter!Method
try_enter!(f, s::OnlyOneRegion)

When called from multiple parallel tasks (on a shared s::OnlyOneRegion) only a single task will execute f.

Example

using OhMyThreads: @tasks
+end
source
OhMyThreads.Tools.taskidMethod
taskid() :: UInt

Return a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist.

source
OhMyThreads.Tools.try_enter!Method
try_enter!(f, s::OnlyOneRegion)

When called from multiple parallel tasks (on a shared s::OnlyOneRegion) only a single task will execute f.

Example

using OhMyThreads: @tasks
 using OhMyThreads.Tools: OnlyOneRegion, try_enter!
 
 only_one = OnlyOneRegion()
@@ -23,4 +23,4 @@
         sleep(1)
     end
     println(i, ": after")
-end
source
+end
source
diff --git a/dev/search_index.js b/dev/search_index.js index e44a53b..e2a7bac 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"EditURL = \"juliaset.jl\"","category":"page"},{"location":"literate/juliaset/juliaset/#Julia-Set","page":"Julia Set","title":"Julia Set","text":"","category":"section"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)\n x = -2.0 + (j - 1) * 4.0 / (n - 1)\n y = -2.0 + (i - 1) * 4.0 / (n - 1)\n\n z = x + y * im\n iter = max_iter\n for k in 1:max_iter\n if abs2(z) > 4.0\n iter = k - 1\n break\n end\n z = z^2 + c\n end\n return iter\nend","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"_compute_pixel (generic function with 1 method)","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.","category":"page"},{"location":"literate/juliaset/juliaset/#Sequential-computation","page":"Julia Set","title":"Sequential computation","text":"","category":"section"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function compute_juliaset_sequential!(img)\n N = size(img, 1)\n for j in 1:N\n for i in 1:N\n img[i, j] = _compute_pixel(i, j, N)\n end\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_sequential!(img);","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's look at the result","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using Plots\np = heatmap(img)","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"literate/juliaset/juliaset/#Parallelization","page":"Julia Set","title":"Parallelization","text":"","category":"section"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: tmap!\n\nfunction compute_juliaset_parallel!(img; kwargs...)\n N = size(img, 1)\n cart = CartesianIndices(img)\n tmap!(img, eachindex(img); kwargs...) do idx\n c = cart[idx]\n _compute_pixel(c[1], c[2], N)\n end\n return img\nend\n\n# or alternatively\n#\n# function compute_juliaset_parallel!(img; kwargs...)\n# N = size(img, 1)\n# cart = CartesianIndices(img)\n# @tasks for idx in eachindex(img)\n# c = cart[idx]\n# img[idx] = _compute_pixel(c[1], c[2], N)\n# end\n# return img\n# end\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_parallel!(img);\np = heatmap(img)","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"literate/juliaset/juliaset/#Benchmark","page":"Julia Set","title":"Benchmark","text":"","category":"section"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's benchmark the variants above.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\nN = 2000\nimg = zeros(Int, N, N)\n\n@show nthreads()\n\n@btime compute_juliaset_sequential!($img) samples=10 evals=3;\n@btime compute_juliaset_parallel!($img) samples=10 evals=3;","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"nthreads() = 10\n 131.295 ms (0 allocations: 0 bytes)\n 31.422 ms (68 allocations: 6.09 KiB)\n","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As hoped, the parallel implementation is much faster!","category":"page"},{"location":"literate/juliaset/juliaset/#Dynamic-vs-static-scheduling","page":"Julia Set","title":"Dynamic vs static scheduling","text":"","category":"section"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As stated above, the per-pixel computation is non-uniform. Hence, we do benefit from the load balancing of the default dynamic scheduler. The latter divides the overall workload into tasks that can then be dynamically distributed among threads to adjust the per-thread load. We can try to fine tune and improve the load balancing further by increasing the ntasks parameter of the scheduler, that is, creating more tasks with smaller per-task workload.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: DynamicScheduler\n\n@btime compute_juliaset_parallel!($img; ntasks=N, scheduler=:dynamic) samples=10 evals=3;","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 17.438 ms (12018 allocations: 1.11 MiB)\n","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that while this turns out to be a bit faster, it comes at the expense of much more allocations.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"To quantify the impact of load balancing we can opt out of dynamic scheduling and use the StaticScheduler instead. The latter doesn't provide any form of load balancing.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: StaticScheduler\n\n@btime compute_juliaset_parallel!($img; scheduler=:static) samples=10 evals=3;","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 30.097 ms (73 allocations: 6.23 KiB)\n","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"CollapsedDocStrings = true","category":"page"},{"location":"refs/internal/#Internal","page":"Internal","title":"Internal","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"warning: Warning\nEverything on this page is internal and and might changed or dropped at any point!","category":"page"},{"location":"refs/internal/#References","page":"Internal","title":"References","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Modules = [OhMyThreads, OhMyThreads.Tools]\nPublic = false\nPages = [\"OhMyThreads.jl\", \"tools.jl\"]","category":"page"},{"location":"refs/internal/#OhMyThreads.Tools.OnlyOneRegion","page":"Internal","title":"OhMyThreads.Tools.OnlyOneRegion","text":"May be used to mark a region in parallel code to be executed by a single task only (all other tasks shall skip over it).\n\nSee try_enter! and reset!.\n\n\n\n\n\n","category":"type"},{"location":"refs/internal/#OhMyThreads.Tools.SimpleBarrier","page":"Internal","title":"OhMyThreads.Tools.SimpleBarrier","text":"SimpleBarrier(n::Integer)\n\nSimple reusable barrier for n parallel tasks.\n\nGiven b = SimpleBarrier(n) and n parallel tasks, each task that calls wait(b) will block until the other n-1 tasks have called wait(b) as well.\n\nExample\n\nn = nthreads()\nbarrier = SimpleBarrier(n)\n@sync for i in 1:n\n @spawn begin\n println(\"A\")\n wait(barrier) # synchronize all tasks\n println(\"B\")\n wait(barrier) # synchronize all tasks (reusable)\n println(\"C\")\n end\nend\n\n\n\n\n\n","category":"type"},{"location":"refs/internal/#OhMyThreads.Tools.nthtid-Tuple{Any}","page":"Internal","title":"OhMyThreads.Tools.nthtid","text":"nthtid(n)\n\nReturns the thread id of the nth Julia thread in the :default threadpool.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.reset!-Tuple{OhMyThreads.Tools.OnlyOneRegion}","page":"Internal","title":"OhMyThreads.Tools.reset!","text":"Reset the OnlyOneRegion (so that it can be used again).\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.taskid-Tuple{}","page":"Internal","title":"OhMyThreads.Tools.taskid","text":"taskid() :: UInt\n\nReturn a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.try_enter!-Tuple{Any, OhMyThreads.Tools.OnlyOneRegion}","page":"Internal","title":"OhMyThreads.Tools.try_enter!","text":"try_enter!(f, s::OnlyOneRegion)\n\nWhen called from multiple parallel tasks (on a shared s::OnlyOneRegion) only a single task will execute f.\n\nExample\n\nusing OhMyThreads: @tasks\nusing OhMyThreads.Tools: OnlyOneRegion, try_enter!\n\nonly_one = OnlyOneRegion()\n\n@tasks for i in 1:10\n @set ntasks = 10\n\n println(i, \": before\")\n try_enter!(only_one) do\n println(i, \": only printed by a single task\")\n sleep(1)\n end\n println(i, \": after\")\nend\n\n\n\n\n\n","category":"method"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"CollapsedDocStrings = true","category":"page"},{"location":"refs/api/#API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"refs/api/#Exported","page":"Public API","title":"Exported","text":"","category":"section"},{"location":"refs/api/#Macros","page":"Public API","title":"Macros","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"@tasks\n@set\n@local\n@only_one\n@one_by_one","category":"page"},{"location":"refs/api/#OhMyThreads.@tasks","page":"Public API","title":"OhMyThreads.@tasks","text":"@tasks for ... end\n\nA macro to parallelize a for loop by spawning a set of tasks that can be run in parallel. The policy of how many tasks to spawn and how to distribute the iteration space among the tasks (and more) can be configured via @set statements in the loop body.\n\nSupports reductions (@set reducer=) and collecting the results (@set collect=true).\n\nUnder the hood, the for loop is translated into corresponding parallel tforeach, tmapreduce, or tmap calls.\n\nSee also: @set, @local\n\nExamples\n\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:3\n println(i)\nend\n\n@tasks for x in rand(10)\n @set reducer=+\n sin(x)\nend\n\n@tasks for i in 1:5\n @set collect=true\n i^2\nend\n\n@tasks for i in 1:100\n @set ntasks=4*nthreads()\n # non-uniform work...\nend\n\n@tasks for i in 1:5\n @set scheduler=:static\n println(\"i=\", i, \" → \", threadid())\nend\n\n@tasks for i in 1:100\n @set begin\n scheduler=:static\n chunksize=10\n end\n println(\"i=\", i, \" → \", threadid())\nend\n\n\n\n\n\n","category":"macro"},{"location":"refs/api/#OhMyThreads.@set","page":"Public API","title":"OhMyThreads.@set","text":"@set name = value\n\nThis can be used inside a @tasks for ... end block to specify settings for the parallel execution of the loop.\n\nMultiple settings are supported, either as separate @set statements or via @set begin ... end.\n\nSettings\n\nreducer (e.g. reducer=+): Indicates that a reduction should be performed with the provided binary function. See tmapreduce for more information.\ncollect (e.g. collect=true): Indicates that results should be collected (similar to map).\n\nAll other settings will be passed on to the underlying parallel functions (e.g. tmapreduce) as keyword arguments. Hence, you may provide whatever these functions accept as keyword arguments. Among others, this includes\n\nscheduler (e.g. scheduler=:static): Can be either a Scheduler or a Symbol (e.g. :dynamic, :static, :serial, or :greedy).\ninit (e.g. init=0.0): Initial value to be used in a reduction (requires reducer=...).\n\nSettings like ntasks, chunksize, and split etc. can be used to tune the scheduling policy (if the selected scheduler supports it).\n\n\n\n\n\n","category":"macro"},{"location":"refs/api/#OhMyThreads.@local","page":"Public API","title":"OhMyThreads.@local","text":"@local name = value\n\n@local name::T = value\n\nCan be used inside a @tasks for ... end block to specify task-local values (TLV) via explicitly typed assignments. These values will be allocated once per task (rather than once per iteration) and can be re-used between different task-local iterations.\n\nThere can only be a single @local block in a @tasks for ... end block. To specify multiple TLVs, use @local begin ... end. Compared to regular assignments, there are some limitations though, e.g. TLVs can't reference each other.\n\nExamples\n\nusing OhMyThreads: @tasks\nusing OhMyThreads.Tools: taskid\n\n@tasks for i in 1:10\n @set begin\n scheduler=:dynamic\n ntasks=2\n end\n @local x = zeros(3) # TLV\n\n x .+= 1\n println(taskid(), \" -> \", x)\nend\n\n@tasks for i in 1:10\n @local begin\n x = rand(Int, 3)\n M = rand(3, 3)\n end\n # ...\nend\n\nTask local variables created by @local are by default constrained to their inferred type, but if you need to, you can specify a different type during declaration:\n\n@tasks for i in 1:10\n @local x::Vector{Float64} = some_hard_to_infer_setup_function()\n # ...\nend\n\n\n\n\n\n","category":"macro"},{"location":"refs/api/#OhMyThreads.@only_one","page":"Public API","title":"OhMyThreads.@only_one","text":"@only_one begin ... end\n\nThis can be used inside a @tasks for ... end block to mark a region of code to be executed by only one of the parallel tasks (all other tasks skip over this region).\n\nExample\n\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:10\n @set ntasks = 10\n\n println(i, \": before\")\n @only_one begin\n println(i, \": only printed by a single task\")\n sleep(1)\n end\n println(i, \": after\")\nend\n\n\n\n\n\n","category":"macro"},{"location":"refs/api/#OhMyThreads.@one_by_one","page":"Public API","title":"OhMyThreads.@one_by_one","text":"@one_by_one begin ... end\n\nThis can be used inside a @tasks for ... end block to mark a region of code to be executed by one parallel task at a time (i.e. exclusive access). The order may be arbitrary and non-deterministic.\n\nExample\n\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:10\n @set ntasks = 10\n\n println(i, \": before\")\n @one_by_one begin\n println(i, \": one task at a time\")\n sleep(0.5)\n end\n println(i, \": after\")\nend\n\n\n\n\n\n","category":"macro"},{"location":"refs/api/#Functions","page":"Public API","title":"Functions","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"tmapreduce\ntreduce\ntmap\ntmap!\ntforeach\ntcollect\ntreducemap","category":"page"},{"location":"refs/api/#OhMyThreads.tmapreduce","page":"Public API","title":"OhMyThreads.tmapreduce","text":"tmapreduce(f, op, A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic],\n [outputtype::Type = Any],\n [init])\n\nA multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nExample:\n\nusing OhMyThreads: tmapreduce\n\ntmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n(√1 + √2) + (√3 + √4) + √5\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\noutputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.\ninit: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.\n\nIn addition, tmapreduce accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntmapreduce(√, +, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treduce","page":"Public API","title":"OhMyThreads.treduce","text":"treduce(op, A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic],\n [outputtype::Type = Any],\n [init])\n\nA multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nExample:\n\nusing OhMyThreads: treduce\n\ntreduce(+, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum([1, 2, 3, 4, 5]) in the form\n\n(1 + 2) + (3 + 4) + 5\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\noutputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.\ninit: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.\n\nIn addition, treduce accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntreduce(+, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap","page":"Public API","title":"OhMyThreads.tmap","text":"tmap(f, [OutputElementType], A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic])\n\nA multithreaded function like Base.map. Create a new container similar to A and fills it in parallel such that the ith element is equal to f(A[i]).\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nExample:\n\nusing OhMyThreads: tmap\n\ntmap(sin, 1:10)\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\n\nIn addition, tmap accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntmap(sin, 1:10; chunksize=2, scheduler=:static)\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap!","page":"Public API","title":"OhMyThreads.tmap!","text":"tmap!(f, out, A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic])\n\nA multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\n\nIn addition, tmap! accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tforeach","page":"Public API","title":"OhMyThreads.tforeach","text":"tforeach(f, A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic]) :: Nothing\n\nA multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing. I.e. it is the parallel equivalent of\n\nfor x in A\n f(x)\nend\n\nExample:\n\nusing OhMyThreads: tforeach\n\ntforeach(1:10) do i\n println(i^2)\nend\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\n\nIn addition, tforeach accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntforeach(1:10; chunksize=2, scheduler=:static) do i\n println(i^2)\nend\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tcollect","page":"Public API","title":"OhMyThreads.tcollect","text":"tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};\n [scheduler::Union{Scheduler, Symbol} = :dynamic])\n\nA multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nExample:\n\nusing OhMyThreads: tcollect\n\ntcollect(sin(i) for i in 1:10)\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\n\nIn addition, tcollect accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntcollect(sin(i) for i in 1:10; chunksize=2, scheduler=:static)\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treducemap","page":"Public API","title":"OhMyThreads.treducemap","text":"treducemap(op, f, A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic],\n [outputtype::Type = Any],\n [init])\n\nLike tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nExample:\n\nusing OhMyThreads: treducemap\n\ntreducemap(+, √, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n(√1 + √2) + (√3 + √4) + √5\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\noutputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.\ninit: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.\n\nIn addition, treducemap accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntreducemap(+, √, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#Schedulers","page":"Public API","title":"Schedulers","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"Scheduler\nDynamicScheduler\nStaticScheduler\nGreedyScheduler\nSerialScheduler","category":"page"},{"location":"refs/api/#OhMyThreads.Schedulers.Scheduler","page":"Public API","title":"OhMyThreads.Schedulers.Scheduler","text":"Supertype for all available schedulers:\n\nDynamicScheduler: default dynamic scheduler\nStaticScheduler: low-overhead static scheduler\nGreedyScheduler: greedy load-balancing scheduler\nSerialScheduler: serial (non-parallel) execution\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#OhMyThreads.Schedulers.DynamicScheduler","page":"Public API","title":"OhMyThreads.Schedulers.DynamicScheduler","text":"DynamicScheduler (aka :dynamic)\n\nThe default dynamic scheduler. Divides the given collection into chunks and then spawns a task per chunk to perform the requested operation in parallel. The tasks are assigned to threads by Julia's dynamic scheduler and are non-sticky, that is, they can migrate between threads.\n\nGenerally preferred since it is flexible, can provide load balancing, and is composable with other multithreaded code.\n\nKeyword arguments:\n\nnchunks::Integer or ntasks::Integer (default nthreads(threadpool)):\nDetermines the number of chunks (and thus also the number of parallel tasks).\nIncreasing nchunks can help with load balancing, but at the expense of creating more overhead. For nchunks <= nthreads() there are not enough chunks for any load balancing.\nSetting nchunks < nthreads() is an effective way to use only a subset of the available threads.\nchunksize::Integer (default not set)\nSpecifies the desired chunk size (instead of the number of chunks).\nThe options chunksize and nchunks/ntasks are mutually exclusive (only one may be a positive integer).\nsplit::Symbol (default :batch):\nDetermines how the collection is divided into chunks (if chunking=true). By default, each chunk consists of contiguous elements and order is maintained.\nSee ChunkSplitters.jl for more details and available options.\nBeware that for split=:scatter the order of elements isn't maintained and a reducer function must not only be associative but also commutative!\nchunking::Bool (default true):\nControls whether input elements are grouped into chunks (true) or not (false).\nFor chunking=false, the arguments nchunks/ntasks, chunksize, and split are ignored and input elements are regarded as \"chunks\" as is. Hence, there will be one parallel task spawned per input element. Note that, depending on the input, this might spawn many(!) tasks and can be costly!\nthreadpool::Symbol (default :default):\nPossible options are :default and :interactive.\nThe high-priority pool :interactive should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes.\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#OhMyThreads.Schedulers.StaticScheduler","page":"Public API","title":"OhMyThreads.Schedulers.StaticScheduler","text":"StaticScheduler (aka :static)\n\nA static low-overhead scheduler. Divides the given collection into chunks and then spawns a task per chunk to perform the requested operation in parallel. The tasks are statically assigned to threads up front and are made sticky, that is, they are guaranteed to stay on the assigned threads (no task migration).\n\nCan sometimes be more performant than DynamicScheduler when the workload is (close to) uniform and, because of the lower overhead, for small workloads. Isn't well composable with other multithreaded code though.\n\nKeyword arguments:\n\nnchunks::Integer or ntasks::Integer (default nthreads()):\nDetermines the number of chunks (and thus also the number of parallel tasks).\nSetting nchunks < nthreads() is an effective way to use only a subset of the available threads.\nFor nchunks > nthreads() the chunks will be distributed to the available threads in a round-robin fashion.\nchunksize::Integer (default not set)\nSpecifies the desired chunk size (instead of the number of chunks).\nThe options chunksize and nchunks/ntasks are mutually exclusive (only one may be non-zero).\nchunking::Bool (default true):\nControls whether input elements are grouped into chunks (true) or not (false).\nFor chunking=false, the arguments nchunks/ntasks, chunksize, and split are ignored and input elements are regarded as \"chunks\" as is. Hence, there will be one parallel task spawned per input element. Note that, depending on the input, this might spawn many(!) tasks and can be costly!\nsplit::Symbol (default :batch):\nDetermines how the collection is divided into chunks. By default, each chunk consists of contiguous elements and order is maintained.\nSee ChunkSplitters.jl for more details and available options.\nBeware that for split=:scatter the order of elements isn't maintained and a reducer function must not only be associative but also commutative!\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#OhMyThreads.Schedulers.GreedyScheduler","page":"Public API","title":"OhMyThreads.Schedulers.GreedyScheduler","text":"GreedyScheduler (aka :greedy)\n\nA greedy dynamic scheduler. The elements of the collection are first put into a Channel and then dynamic, non-sticky tasks are spawned to process the channel content in parallel.\n\nNote that elements are processed in a non-deterministic order, and thus a potential reducing function must be commutative in addition to being associative, or you could get incorrect results!\n\nCan be good choice for load-balancing slower, uneven computations, but does carry some additional overhead.\n\nKeyword arguments:\n\nntasks::Int (default nthreads()):\nDetermines the number of parallel tasks to be spawned.\nSetting ntasks < nthreads() is an effective way to use only a subset of the available threads.\nchunking::Bool (default false):\nControls whether input elements are grouped into chunks (true) or not (false) before put into the channel. This can improve the performance especially if there are many iterations each of which are computationally cheap.\nIf nchunks or chunksize are explicitly specified, chunking will be automatically set to true.\nnchunks::Integer (default 10 * nthreads()):\nDetermines the number of chunks (that will eventually be put into the channel).\nIncreasing nchunks can help with load balancing. For nchunks <= nthreads() there are not enough chunks for any load balancing.\nchunksize::Integer (default not set)\nSpecifies the desired chunk size (instead of the number of chunks).\nThe options chunksize and nchunks are mutually exclusive (only one may be a positive integer).\nsplit::Symbol (default :scatter):\nDetermines how the collection is divided into chunks (if chunking=true).\nSee ChunkSplitters.jl for more details and available options.\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#OhMyThreads.Schedulers.SerialScheduler","page":"Public API","title":"OhMyThreads.Schedulers.SerialScheduler","text":"SerialScheduler (aka :serial)\n\nA scheduler for turning off any multithreading and running the code in serial. It aims to make parallel functions like, e.g., tmapreduce(sin, +, 1:100) behave like their serial counterparts, e.g., mapreduce(sin, +, 1:100).\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#Non-Exported","page":"Public API","title":"Non-Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.@spawn see StableTasks.jl\nOhMyThreads.@spawnat see StableTasks.jl\nOhMyThreads.@fetch see StableTasks.jl\nOhMyThreads.@fetchfrom see StableTasks.jl\nOhMyThreads.chunks see ChunkSplitters.jl\nOhMyThreads.TaskLocalValue see TaskLocalValues.jl","category":"page"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"OhMyThreads.WithTaskLocals\nOhMyThreads.promise_task_local","category":"page"},{"location":"refs/api/#OhMyThreads.WithTaskLocals","page":"Public API","title":"OhMyThreads.WithTaskLocals","text":"struct WithTaskLocals{F, TLVs <: Tuple{Vararg{TaskLocalValue}}} <: Function\n\nThis callable function-like object is meant to represent a function which closes over some TaskLocalValues. This is, if you do\n\nTLV{T} = TaskLocalValue{T}\nf = WithTaskLocals((TLV{Int}(() -> 1), TLV{Int}(() -> 2))) do (x, y)\n z -> (x + y)/z\nend\n\nthen that is equivalent to\n\ng = let x = TLV{Int}(() -> 1), y = TLV{Int}(() -> 2)\n z -> let x = x[], y=y[]\n (x + y)/z\n end\nend\n\nhowever, the main difference is that you can call promise_task_local on a WithTaskLocals closure in order to turn it into something equivalent to\n\nlet x=x[], y=y[]\n z -> (x + y)/z\nend\n\nwhich doesn't have the overhead of accessing the task_local_storage each time the closure is called. This of course will lose the safety advantages of TaskLocalValue, so you should never do f_local = promise_task_local(f) and then pass f_local to some unknown function, because if that unknown function calls f_local on a new task, you'll hit a race condition.\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#OhMyThreads.promise_task_local","page":"Public API","title":"OhMyThreads.promise_task_local","text":"promise_task_local(f) = f\npromise_task_local(f::WithTaskLocals) = f.inner_func(map(x -> x[], f.tasklocals))\n\nTake a WithTaskLocals closure, grab the TaskLocalValues, and passs them to the closure. That is, it turns a WithTaskLocals closure from the equivalent of\n\nTLV{T} = TaskLocalValue{T}\nlet x = TLV{Int}(() -> 1), y = TLV{Int}(() -> 2)\n z -> let x = x[], y=y[]\n (x + y)/z\n end\nend\n\ninto the equivalent of\n\nlet x = TLV{Int}(() -> 1), y = TLV{Int}(() -> 2)\n let x = x[], y = y[]\n z -> (x + y)/z\n end\nend\n\nwhich doesn't have the overhead of accessing the task_local_storage each time the closure is called. This of course will lose the safety advantages of TaskLocalValue, so you should never do f_local = promise_task_local(f) and then pass f_local to some unknown function, because if that unknown function calls f_local on a new task, you'll hit a race condition. ```\n\n\n\n\n\n","category":"function"},{"location":"basics/#Basics","page":"Basics","title":"Basics","text":"","category":"section"},{"location":"basics/","page":"Basics","title":"Basics","text":"This section is still in preparation. For now, you might want to take a look at the translation guide and the examples.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"EditURL = \"integration.jl\"","category":"page"},{"location":"literate/integration/integration/#Trapezoidal-Integration","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"section"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"In this example, we want to parallelize the computation of a simple numerical integral via the trapezoidal rule. The latter is given by","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"int_a^bf(x)dx approx h sum_i=1^Nfracf(x_i-1)+f(x_i)2","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The function to be integrated is the following.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f(x) = 4 * √(1 - x^2)","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f (generic function with 1 method)","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The analytic result of the definite integral (from 0 to 1) is known to be pi.","category":"page"},{"location":"literate/integration/integration/#Sequential","page":"Trapezoidal Integration","title":"Sequential","text":"","category":"section"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Naturally, we implement the trapezoidal rule as a straightforward, sequential for loop.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"function trapezoidal(a, b, n; h = (b - a) / n)\n y = (f(a) + f(b)) / 2.0\n for i in 1:(n - 1)\n x = a + i * h\n y = y + f(x)\n end\n return y * h\nend","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal (generic function with 1 method)","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Let's compute the integral of f above and see if we get the expected result. For simplicity, we choose N, the number of panels used to discretize the integration interval, as a multiple of the number of available Julia threads.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using Base.Threads: nthreads\n\nN = nthreads() * 1_000_000","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"10000000","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Calling trapezoidal we do indeed find the (approximate) value of pi.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal(0, 1, N) ≈ π","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"literate/integration/integration/#Parallel","page":"Trapezoidal Integration","title":"Parallel","text":"","category":"section"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Our strategy is the following: Divide the integration interval among the available Julia threads. On each thread, use the sequential trapezoidal rule to compute the partial integral. It is straightforward to implement this strategy with tmapreduce. The map part is, essentially, the application of trapezoidal and the reduction operator is chosen to be + to sum up the local integrals.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using OhMyThreads\n\nfunction trapezoidal_parallel(a, b, N)\n n = N ÷ nthreads()\n h = (b - a) / N\n return tmapreduce(+, 1:nthreads()) do i\n local α = a + (i - 1) * n * h # the local keywords aren't necessary but good practice\n local β = α + n * h\n trapezoidal(α, β, n; h)\n end\nend\n\n# or equivalently\n#\n# function trapezoidal_parallel(a, b, N)\n# n = N ÷ nthreads()\n# h = (b - a) / N\n# @tasks for i in 1:nthreads()\n# @set reducer=+\n# local α = a + (i - 1) * n * h\n# local β = α + n * h\n# trapezoidal(α, β, n; h)\n# end\n# end","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel (generic function with 1 method)","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"First, we check the correctness of our parallel implementation.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel(0, 1, N) ≈ π","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Then, we benchmark and compare the performance of the sequential and parallel versions.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using BenchmarkTools\n@btime trapezoidal(0, 1, $N);\n@btime trapezoidal_parallel(0, 1, $N);","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":" 24.348 ms (0 allocations: 0 bytes)\n 2.457 ms (69 allocations: 6.05 KiB)\n","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"nthreads()","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"10","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"This page was generated using Literate.jl.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"EditURL = \"falsesharing.jl\"","category":"page"},{"location":"literate/falsesharing/falsesharing/#FalseSharing","page":"False Sharing","title":"False Sharing","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"False Sharing is a very common but subtle performance issue that comes up again and again when writing parallel code manually. For this reason, we shall discuss what it is about and how to avoid it.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"For simplicity, let's focus on a specific example: parallel summation.","category":"page"},{"location":"literate/falsesharing/falsesharing/#Baseline:-sequential-summation","page":"False Sharing","title":"Baseline: sequential summation","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"To establish a baseline, that we can later compare against, we define some fake data, which we'll sum up, and benchmark Julia's built-in, non-parallel sum function.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"using Base.Threads: nthreads\nusing BenchmarkTools\n\ndata = rand(1_000_000 * nthreads());\n@btime sum($data);","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":" 2.327 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"literate/falsesharing/falsesharing/#The-problematic-parallel-implementation","page":"False Sharing","title":"The problematic parallel implementation","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"A conceptually simple (and valid) approach to parallelizing the summation is to divide the full computation into parts. Specifically, the idea is to divide the data into chunks, compute the partial sums of these chunks in parallel, and finally sum up the partial results. (Note that we will not concern ourselves with potential minor or catastrophic numerical errors due to potential rearrangements of terms in the summation here.)","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"A common, manual implementation of this idea might look like this:","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"using OhMyThreads: @spawn, chunks\n\nfunction parallel_sum_falsesharing(data; nchunks = nthreads())\n psums = zeros(eltype(data), nchunks)\n @sync for (c, idcs) in enumerate(chunks(data; n = nchunks))\n @spawn begin\n for i in idcs\n psums[c] += data[i]\n end\n end\n end\n return sum(psums)\nend","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"parallel_sum_falsesharing (generic function with 1 method)","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"The code is pretty straightforward: We allocate space for the results of the partial sums (psums) and, on nchunks many tasks, add up the data elements of each partial sum in parallel. More importantly, and in this context perhaps surprisingly, the code is also correct in the sense that it produces the desired result.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"using Test\n@test sum(data) ≈ parallel_sum_falsesharing(data)","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Test Passed","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"This is just a reflection of the fact that there is no logical sharing of data - because each parallel tasks modifies a different element of psums - implying the absence of race conditions.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"What's the issue then?! Well, the sole purpose of parallelization is to reduce runtime. So let's see how well we're doing in this respect.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"nthreads()","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"10","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"@btime parallel_sum_falsesharing($data);","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":" 52.919 ms (221 allocations: 18.47 KiB)\n","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"A (huge) slowdown?! Clearly, that's the opposite of what we tried to achieve!","category":"page"},{"location":"literate/falsesharing/falsesharing/#The-issue:-False-sharing","page":"False Sharing","title":"The issue: False sharing","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Although our parallel summation above is semantically correct, it has a big performance issue: False sharing. To understand false sharing, we have to think a little bit about how computers work. Specifically, we need to realize that processors cache memory in lines (rather than individual elements) and that caches of different processors are kept coherent. When two (or more) different CPU cores operate on independent data elements that fall into the same cache line (i.e. they are part of the same memory address region) the cache coherency mechanism leads to costly synchronization between cores.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"In our case, this happens despite the fact that different parallel tasks (on different CPU cores) logically don't care about the rest of the data in the cache line at all.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"(Image: )","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Given these insights, we can come up with a few workarounds that mitigate the issue. The most prominent is probably padding, where one simply adds sufficiently many unused zeros to psums such that different partial sum counters don't fall into the same cache line. However, let's discuss a more fundamental, more efficient, and more elegant solution.","category":"page"},{"location":"literate/falsesharing/falsesharing/#Task-local-parallel-summation","page":"False Sharing","title":"Task-local parallel summation","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"The key mistake in parallel_sum_falsesharing above is the non-local modification of (implicitly) shared state (cache lines of psums) very frequently (in the innermost loop). We can simply avoid this by making the code more task-local. To this end, we introduce a task-local accumulator variable, which we use to perform the task-local partial sums. Only at the very end do we communicate the result to the main thread, e.g. by writing it into psums (once!).","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"function parallel_sum_tasklocal(data; nchunks = nthreads())\n psums = zeros(eltype(data), nchunks)\n @sync for (c, idcs) in enumerate(chunks(data; n = nchunks))\n @spawn begin\n local s = zero(eltype(data))\n for i in idcs\n s += data[i]\n end\n psums[c] = s\n end\n end\n return sum(psums)\nend\n\n@test sum(data) ≈ parallel_sum_tasklocal(data)\n@btime parallel_sum_tasklocal($data);","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":" 1.120 ms (221 allocations: 18.55 KiB)\n","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Finally, there is a speed up! 🎉","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Two comments are in order.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"First, we note that the only role that psums plays is as a temporary storage for the results from the parallel tasks to be able to sum them up eventually. We could get rid of it entirely by using a Threads.Atomic instead which would get updated via Threads.atomic_add! from each task directly. However, for our discussion, this is a detail and we won't discuss it further.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Secondly, while keeping the general idea, we can drastically simplify the above code by using map and reusing the built-in (sequential) sum function on each parallel task:","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"function parallel_sum_map(data; nchunks = nthreads())\n ts = map(chunks(data, n = nchunks)) do idcs\n @spawn @views sum(data[idcs])\n end\n return sum(fetch.(ts))\nend\n\n@test sum(data) ≈ parallel_sum_map(data)\n@btime parallel_sum_map($data);","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":" 893.396 μs (64 allocations: 5.72 KiB)\n","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"This implementation is conceptually clearer in that there is no explicit modification of shared state, i.e. no pums[c] = s, anywhere at all. We can't run into false sharing if we don't modify shared state 😉.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Note that since we use the built-in sum function, which is highly optimized, we might see better runtimes due to other effects - like SIMD and the absence of bounds checks - compared to the simple for-loop accumulation in parallel_sum_tasklocal above.","category":"page"},{"location":"literate/falsesharing/falsesharing/#Parallel-summation-with-OhMyThreads","page":"False Sharing","title":"Parallel summation with OhMyThreads","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Finally, all of the above is abstracted away for you if you simply use treduce to implement the parallel summation. It also only takes a single line and function call.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"using OhMyThreads: treduce\n\n@test sum(data) ≈ treduce(+, data; ntasks = nthreads())\n@btime treduce($+, $data; ntasks = $nthreads());","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":" 899.097 μs (68 allocations: 5.92 KiB)\n","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"This page was generated using Literate.jl.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"EditURL = \"tls.jl\"","category":"page"},{"location":"literate/tls/tls/#TSS","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"For some programs, it can be useful or even necessary to allocate and (re-)use memory in your parallel code (e.g. your computation might require temporary buffers). The following section demonstrates common issues that can arise in such a scenario and, by means of a simple example, explains techniques to handle such cases safely. Specifically, we'll dicuss (1) how task-local storage (TLS) can be used efficiently and (2) how channels can be used to organize per-task buffer allocation in a thread-safe manner.","category":"page"},{"location":"literate/tls/tls/#Test-case-(sequential)","page":"Thread-Safe Storage","title":"Test case (sequential)","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Let's say that we are given two arrays of matrices, As and Bs, and let's further assume that our goal is to compute the total sum of all pairwise matrix products. We can readily implement a (sequential) function that performs the necessary computations.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using LinearAlgebra: mul!, BLAS\nBLAS.set_num_threads(1) # for simplicity, we turn off OpenBLAS multithreading\n\nfunction matmulsums(As, Bs)\n N = size(first(As), 1)\n C = Matrix{Float64}(undef, N, N)\n map(As, Bs) do A, B\n mul!(C, A, B)\n sum(C)\n end\nend","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"matmulsums (generic function with 1 method)","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Here, we use map to perform the desired operation for each pair of matrices, A and B. However, the crucial point for our discussion is that we want to use the in-place matrix multiplication LinearAlgebra.mul! in conjunction with a pre-allocated temporary buffer, the output matrix C. This is to avoid the temporary allocation per \"iteration\" (i.e. per matrix pair) that we would get with C = A*B.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"For later comparison, we generate some random input data and store the result.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"As = [rand(256, 16) for _ in 1:768]\nBs = [rand(16, 256) for _ in 1:768]\n\nres = matmulsums(As, Bs);","category":"page"},{"location":"literate/tls/tls/#How-to-not-parallelize","page":"Thread-Safe Storage","title":"How to not parallelize","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"The key idea for creating a parallel version of matmulsums is to replace the map by OhMyThreads' parallel tmap function. However, because we re-use C, this isn't entirely trivial. Someone new to parallel computing might be tempted to parallelize matmulsums like this:","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using OhMyThreads: tmap\n\nfunction matmulsums_race(As, Bs)\n N = size(first(As), 1)\n C = Matrix{Float64}(undef, N, N)\n tmap(As, Bs) do A, B\n mul!(C, A, B)\n sum(C)\n end\nend","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"matmulsums_race (generic function with 1 method)","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Unfortunately, this doesn't produce the correct result.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"res_race = matmulsums_race(As, Bs)\nres ≈ res_race","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"false","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"In fact, it doesn't even always produce the same result (check for yourself)! The reason is that there is a race condition: different parallel tasks are trying to use the shared variable C simultaneously leading to non-deterministic behavior. Let's see how we can fix this.","category":"page"},{"location":"literate/tls/tls/#The-naive-(and-inefficient)-fix","page":"Thread-Safe Storage","title":"The naive (and inefficient) fix","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"A simple solution for the race condition issue above is to move the allocation of C into the body of the parallel tmap:","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"function matmulsums_naive(As, Bs)\n N = size(first(As), 1)\n tmap(As, Bs) do A, B\n C = Matrix{Float64}(undef, N, N)\n mul!(C, A, B)\n sum(C)\n end\nend","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"matmulsums_naive (generic function with 1 method)","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"In this case, a separate C will be allocated for each iteration such that parallel tasks no longer mutate shared state. Hence, we'll get the desired result.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"res_naive = matmulsums_naive(As, Bs)\nres ≈ res_naive","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"However, this variant is obviously inefficient because it is no better than just writing C = A*B and thus leads to one allocation per matrix pair. We need a different way of allocating and re-using C for an efficient parallel version.","category":"page"},{"location":"literate/tls/tls/#TLS","page":"Thread-Safe Storage","title":"Task-local storage","text":"","category":"section"},{"location":"literate/tls/tls/#The-manual-(and-cumbersome)-way","page":"Thread-Safe Storage","title":"The manual (and cumbersome) way","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"We've seen that we can't allocate C once up-front (→ race condition) and also shouldn't allocate it within the tmap (→ one allocation per iteration). Instead, we can assign a separate \"C\" on each parallel task once and then use this task-local \"C\" for all iterations (i.e. matrix pairs) for which this task is responsible. Before we learn how to do this more conveniently, let's implement this idea of a task-local temporary buffer (for each parallel task) manually.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using OhMyThreads: chunks, @spawn\nusing Base.Threads: nthreads\n\nfunction matmulsums_manual(As, Bs)\n N = size(first(As), 1)\n tasks = map(chunks(As; n = 2 * nthreads())) do idcs\n @spawn begin\n local C = Matrix{Float64}(undef, N, N)\n map(idcs) do i\n A = As[i]\n B = Bs[i]\n\n mul!(C, A, B)\n sum(C)\n end\n end\n end\n mapreduce(fetch, vcat, tasks)\nend\n\nres_manual = matmulsums_manual(As, Bs)\nres ≈ res_manual","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"We note that this is rather cumbersome and you might not want to write it (repeatedly). But let's take a closer look and see what's happening here. First, we divide the number of matrix pairs into 2 * nthreads() chunks. Then, for each of those chunks, we spawn a parallel task that (1) allocates a task-local C matrix (and a results vector) and (2) performs the actual computations using these pre-allocated buffers. Finally, we fetch the results of the tasks and combine them. This variant works just fine and the good news is that we can get the same behavior with less manual work.","category":"page"},{"location":"literate/tls/tls/#TLV","page":"Thread-Safe Storage","title":"The shortcut: TaskLocalValue","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"The desire for task-local storage is quite natural with task-based multithreading. For this reason, Julia supports this out of the box with Base.task_local_storage. But instead of using this directly (which you could), we will use a convenience wrapper around it called TaskLocalValue. This allows us to express the idea from above in few lines of code:","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using OhMyThreads: TaskLocalValue\n\nfunction matmulsums_tlv(As, Bs; kwargs...)\n N = size(first(As), 1)\n tlv = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))\n tmap(As, Bs; kwargs...) do A, B\n C = tlv[]\n mul!(C, A, B)\n sum(C)\n end\nend\n\nres_tlv = matmulsums_tlv(As, Bs)\nres ≈ res_tlv","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Here, TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N)) creates a task-local value - essentially a reference to a value in the task-local storage - that behaves like this: The first time the task-local value is accessed from a task (tls[]) it is initialized according to the provided anonymous function. Afterwards, every following query (from the same task!) will simply lookup and return the task-local value. This solves our issues above and leads to O(textrmparallel tasks) (instead of O(textrmiterations)) allocations.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Note that if you use our @tasks macro API, there is built-in support for task-local values via @local.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using OhMyThreads: @tasks\n\nfunction matmulsums_tlv_macro(As, Bs; kwargs...)\n N = size(first(As), 1)\n @tasks for i in eachindex(As, Bs)\n @set collect = true\n @local C = Matrix{Float64}(undef, N, N)\n mul!(C, As[i], Bs[i])\n sum(C)\n end\nend\n\nres_tlv_macro = matmulsums_tlv_macro(As, Bs)\nres ≈ res_tlv_macro","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Here, @local expands to a pattern similar to the TaskLocalValue one above, although automatically infers that the object's type is Matrix{Float64}, and it carries some optimizations (see OhMyThreads.WithTaskLocals) which can make accessing task local values more efficient in loops which take on the order of 100ns to complete.","category":"page"},{"location":"literate/tls/tls/#Benchmark","page":"Thread-Safe Storage","title":"Benchmark","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"The whole point of parallelization is increasing performance, so let's benchmark and compare the performance of the variants that we've discussed so far.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using BenchmarkTools\n\n@show nthreads()\n\n@btime matmulsums($As, $Bs);\n@btime matmulsums_naive($As, $Bs);\n@btime matmulsums_manual($As, $Bs);\n@btime matmulsums_tlv($As, $Bs);\n@btime matmulsums_tlv_macro($As, $Bs);","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"nthreads() = 10\n 61.314 ms (3 allocations: 518.17 KiB)\n 22.122 ms (1621 allocations: 384.06 MiB)\n 7.620 ms (204 allocations: 10.08 MiB)\n 7.702 ms (126 allocations: 5.03 MiB)\n 7.600 ms (127 allocations: 5.03 MiB)\n","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"As we can see, matmulsums_tlv (and matmulsums_tlv_macro) isn't only convenient but also efficient: It allocates much less memory than matmulsums_naive and is about on par with the manual implementation.","category":"page"},{"location":"literate/tls/tls/#Per-thread-allocation","page":"Thread-Safe Storage","title":"Per-thread allocation","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"The task-local solution above has one potential caveat: If we spawn many parallel tasks (e.g. for load-balancing reasons) we need just as many task-local buffers. This can clearly be suboptimal because only nthreads() tasks can run simultaneously. Hence, one buffer per thread should actually suffice. Of course, this raises the question of how to organize a pool of \"per-thread\" buffers such that each running task always has exclusive (temporary) access to a buffer (we need to make sure to avoid races).","category":"page"},{"location":"literate/tls/tls/#The-naive-(and-incorrect)-approach","page":"Thread-Safe Storage","title":"The naive (and incorrect) approach","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"A naive approach to implementing this idea is to pre-allocate an array of buffers and then to use the threadid() to select a buffer for a running task.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using Base.Threads: threadid\n\nfunction matmulsums_perthread_naive(As, Bs)\n N = size(first(As), 1)\n Cs = [Matrix{Float64}(undef, N, N) for _ in 1:nthreads()]\n tmap(As, Bs) do A, B\n C = Cs[threadid()]\n mul!(C, A, B)\n sum(C)\n end\nend\n\n# non uniform workload\nAs_nu = [rand(256, isqrt(i)^2) for i in 1:768];\nBs_nu = [rand(isqrt(i)^2, 256) for i in 1:768];\nres_nu = matmulsums(As_nu, Bs_nu);\n\nres_pt_naive = matmulsums_perthread_naive(As_nu, Bs_nu)\nres_nu ≈ res_pt_naive","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Unfortunately, this approach is generally wrong. The first issue is that threadid() doesn't necessarily start at 1 (and thus might return a value > nthreads()), in which case Cs[threadid()] would be an out-of-bounds access attempt. This might be surprising but is a simple consequence of the ordering of different kinds of Julia threads: If Julia is started with a non-zero number of interactive threads, e.g. --threads 5,2, the interactive threads come first (look at Threads.threadpool.(1:Threads.maxthreadid())).","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"But even if we account for this offset there is another, more fundamental problem, namely task-migration. By default, all spawned parallel tasks are \"non-sticky\" and can dynamically migrate between different Julia threads (loosely speaking, at any point in time). This means nothing other than that threadid() is not necessarily constant for a task! For example, imagine that task A starts on thread 4, loads the buffer Cs[4], but then gets paused, migrated, and continues executation on, say, thread 5. Afterwards, while task A is performing mul!(Cs[4], ...), a different task B might start on (the now available) thread 4 and also read and use Cs[4]. This would lead to a race condition because both tasks are mutating the same buffer. (Note that, in practice, this - most likely 😉 - doesn't happen for the very simple example above, but you can't rely on it!)","category":"page"},{"location":"literate/tls/tls/#The-quick-fix-(with-caveats)","page":"Thread-Safe Storage","title":"The quick fix (with caveats)","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"A simple solution for the task-migration issue is to opt-out of dynamic scheduling with scheduler=:static (or scheduler=StaticScheduler()). This scheduler statically assigns tasks to threads upfront without any dynamic rescheduling (the tasks are sticky and won't migrate).","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"function matmulsums_perthread_static(As, Bs)\n N = size(first(As), 1)\n Cs = [Matrix{Float64}(undef, N, N) for _ in 1:nthreads()]\n tmap(As, Bs; scheduler = :static) do A, B\n C = Cs[threadid()]\n mul!(C, A, B)\n sum(C)\n end\nend\n\nres_pt_static = matmulsums_perthread_static(As_nu, Bs_nu)\nres_nu ≈ res_pt_static","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"However, this approach doesn't solve the offset issue and, even worse, makes the parallel code non-composable: If we call other multithreaded functions within the tmap or if our parallel matmulsums_perthread_static itself gets called from another parallel region we will likely oversubscribe the Julia threads and get subpar performance. Given these caveats, we should therefore generally take a different approach.","category":"page"},{"location":"literate/tls/tls/#The-safe-way:-Channel","page":"Thread-Safe Storage","title":"The safe way: Channel","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Instead of storing the pre-allocated buffers in an array, we can put them into a Channel which internally ensures that parallel access is safe. In this scenario, we simply take! a buffer from the channel whenever we need it and put! it back after our computation is done.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"function matmulsums_perthread_channel(As, Bs; nbuffers = nthreads(), kwargs...)\n N = size(first(As), 1)\n chnl = Channel{Matrix{Float64}}(nbuffers)\n foreach(1:nbuffers) do _\n put!(chnl, Matrix{Float64}(undef, N, N))\n end\n tmap(As, Bs; kwargs...) do A, B\n C = take!(chnl)\n mul!(C, A, B)\n result = sum(C)\n put!(chnl, C)\n result\n end\nend\n\nres_pt_channel = matmulsums_perthread_channel(As_nu, Bs_nu)\nres_nu ≈ res_pt_channel","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/#Benchmark-2","page":"Thread-Safe Storage","title":"Benchmark","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Let's benchmark the variants above and compare them to the task-local implementation. We want to look at both ntasks = nthreads() and ntasks > nthreads(), the latter of which gives us dynamic load balancing.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"# no load balancing because ntasks == nthreads()\n@btime matmulsums_tlv($As_nu, $Bs_nu);\n@btime matmulsums_perthread_static($As_nu, $Bs_nu);\n@btime matmulsums_perthread_channel($As_nu, $Bs_nu);\n\n# load balancing because ntasks > nthreads()\n@btime matmulsums_tlv($As_nu, $Bs_nu; ntasks = 2 * nthreads());\n@btime matmulsums_perthread_channel($As_nu, $Bs_nu; ntasks = 2 * nthreads());\n\n@btime matmulsums_tlv($As_nu, $Bs_nu; ntasks = 10 * nthreads());\n@btime matmulsums_perthread_channel($As_nu, $Bs_nu; ntasks = 10 * nthreads());","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":" 170.563 ms (126 allocations: 5.03 MiB)\n 165.647 ms (108 allocations: 5.02 MiB)\n 172.216 ms (114 allocations: 5.02 MiB)\n 108.662 ms (237 allocations: 10.05 MiB)\n 114.673 ms (185 allocations: 5.04 MiB)\n 97.933 ms (1118 allocations: 50.13 MiB)\n 96.868 ms (746 allocations: 5.10 MiB)\n","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Note that the runtime of matmulsums_perthread_channel improves with increasing number of chunks/tasks (due to load balancing) while the amount of allocated memory doesn't increase much. Contrast this with the drastic memory increase with matmulsums_tlv.","category":"page"},{"location":"literate/tls/tls/#Another-safe-way-based-on-Channel","page":"Thread-Safe Storage","title":"Another safe way based on Channel","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Above, we chose to put a limited number of buffers (e.g. nthreads()) into the channel and then spawn many tasks (one per input element). Sometimes it can make sense to flip things around and put the (many) input elements into a channel and only spawn a limited number of tasks (e.g. nthreads()) with task-local buffers.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using OhMyThreads: tmapreduce\n\nfunction matmulsums_perthread_channel_flipped(As, Bs; ntasks = nthreads())\n N = size(first(As), 1)\n chnl = Channel{Int}(length(As); spawn = true) do chnl\n for i in 1:length(As)\n put!(chnl, i)\n end\n end\n tmapreduce(vcat, 1:ntasks; chunking=false) do _ # we turn chunking off\n local C = Matrix{Float64}(undef, N, N)\n map(chnl) do i # implicitly takes the values from the channel (parallel safe)\n A = As[i]\n B = Bs[i]\n mul!(C, A, B)\n sum(C)\n end\n end\nend;","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Note that one caveat of this approach is that the input → task assignment, and thus the order of the output, is non-deterministic. For this reason, we sort the output to check for correctness.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"res_channel_flipped = matmulsums_perthread_channel_flipped(As_nu, Bs_nu)\nsort(res_nu) ≈ sort(res_channel_flipped)","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Quick benchmark:","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"@btime matmulsums_perthread_channel_flipped($As_nu, $Bs_nu);\n@btime matmulsums_perthread_channel_flipped($As_nu, $Bs_nu; ntasks = 2 * nthreads());\n@btime matmulsums_perthread_channel_flipped($As_nu, $Bs_nu; ntasks = 10 * nthreads());","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":" 94.389 ms (170 allocations: 5.07 MiB)\n 94.580 ms (271 allocations: 10.10 MiB)\n 94.768 ms (1071 allocations: 50.41 MiB)\n","category":"page"},{"location":"literate/tls/tls/#Bumper.jl-(only-for-the-brave)","page":"Thread-Safe Storage","title":"Bumper.jl (only for the brave)","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"If you are bold and want to cut down temporary allocations even more you can give Bumper.jl a try. Essentially, it allows you to bring your own stacks, that is, task-local bump allocators which you can dynamically allocate memory to, and reset them at the end of a code block, just like Julia's stack. Be warned though that Bumper.jl is (1) a rather young package with (likely) some bugs and (2) can easily lead to segfaults when used incorrectly. If you can live with the risk, Bumper.jl is especially useful for cases where the size of the preallocated matrix isn't known ahead of time, and even more useful if we want to do many intermediate allocations on the task, not just one. For our example, this isn't the case but let's nonetheless how one would use Bumper.jl here.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using Bumper\n\nfunction matmulsums_bumper(As, Bs)\n tmap(As, Bs) do A, B\n @no_escape begin # promising that no memory will escape\n N = size(A, 1)\n C = @alloc(Float64, N, N) # from bump allocater (fake \"stack\")\n mul!(C, A, B)\n sum(C)\n end\n end\nend\n\nres_bumper = matmulsums_bumper(As, Bs);\nsort(res) ≈ sort(res_bumper)\n\n@btime matmulsums_bumper($As, $Bs);","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":" 7.814 ms (134 allocations: 27.92 KiB)\n","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Note that the benchmark is lying here about the total memory allocation, because it doesn't show the allocation of the task-local bump allocators themselves (the reason is that SlabBuffer uses malloc directly).","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"This page was generated using Literate.jl.","category":"page"},{"location":"translation/#TG","page":"Translation Guide","title":"Translation Guide","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"This page tries to give a general overview of how to translate patterns written with the built-in tools of Base.Threads using the OhMyThreads.jl API. Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.","category":"page"},{"location":"translation/#Basics","page":"Translation Guide","title":"Basics","text":"","category":"section"},{"location":"translation/#@threads","page":"Translation Guide","title":"@threads","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\nusing Base.Threads: @threads\n\n@threads for i in 1:10\n println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:10\n println(i)\nend\n\n# or\nusing OhMyThreads: tforeach\n\ntforeach(1:10) do i\n println(i)\nend","category":"page"},{"location":"translation/#:static-scheduling","page":"Translation Guide","title":":static scheduling","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\nusing Base.Threads: @threads\n\n@threads :static for i in 1:10\n println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:10\n @set scheduler=:static\n println(i)\nend\n\n# or\nusing OhMyThreads: tforeach\n\ntforeach(1:10; scheduler=:static) do i\n println(i)\nend","category":"page"},{"location":"translation/#@spawn","page":"Translation Guide","title":"@spawn","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\nusing Base.Threads: @spawn\n\n@sync for i in 1:10\n @spawn println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:10\n @set chunking=false\n println(i)\nend\n\n# or\nusing OhMyThreads: tforeach\n\ntforeach(1:10; chunking=false) do i\n println(i)\nend\n\n# or\nusing OhMyThreads: @spawn\n\n@sync for i in 1:10\n @spawn println(i)\nend","category":"page"},{"location":"translation/#Reduction","page":"Translation Guide","title":"Reduction","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"No built-in feature in Base.Threads.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads: basic manual implementation\nusing Base.Threads: @spawn\n\ndata = rand(10)\nchunks_itr = Iterators.partition(data, length(data) ÷ nthreads())\ntasks = map(chunks_itr) do chunk\n @spawn reduce(+, chunk)\nend\nreduce(+, fetch.(tasks))","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\ndata = rand(10)\n\n@tasks for x in data\n @set reducer=+\nend\n\n# or\nusing OhMyThreads: treduce\n\ntreduce(+, data)","category":"page"},{"location":"translation/#Mutation","page":"Translation Guide","title":"Mutation","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"warning: Warning\nParallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. false sharing). You should carefully consider whether this is necessary or whether the use of thread-safe storage is the better option. We don't recommend using the examples in this section for anything serious!","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\nusing Base.Threads: @threads\ndata = rand(10)\n\n@threads for i in eachindex(data)\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\ndata = rand(10)\n\n@tasks for i in eachindex(data)\n data[i] = calc(i)\nend\n\n# or\nusing OhMyThreads: tforeach\n\ntforeach(eachindex(data)) do i\n data[i] = calc(i)\nend\n\n# or\nusing OhMyThreads: tmap!\n\ntmap!(data, eachindex(data)) do i\n calc(i)\nend","category":"page"},{"location":"translation/#Parallel-initialization","page":"Translation Guide","title":"Parallel initialization","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"warning: Warning\nParallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. false sharing). You should carefully consider whether this is necessary or whether the use of thread-safe storage is the better option. We don't recommend using the examples in this section for anything serious!","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\nusing Base.Threads: @threads\n\ndata = Vector{Float64}(undef, 10)\n@threads for i in eachindex(data)\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\n\ndata = @tasks for i in 1:10\n @set collect=true\n calc(i)\nend\n\n# or\nusing OhMyThreads: tmap\n\ndata = tmap(i->calc(i), 1:10)\n\n# or\nusing OhMyThreads: tcollect\n\ndata = tcollect(calc(i) for i in 1:10)","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"EditURL = \"mc.jl\"","category":"page"},{"location":"literate/mc/mc/#Parallel-Monte-Carlo","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"section"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Calculate the value of pi through parallel direct Monte Carlo.","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is pi, the area of the square is 4, and the ratio is pi4. This means that, if you throw N darts randomly at the square, approximately M=Npi4 of those darts will land inside the unit circle.","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Throw darts randomly at a unit square and count how many of them (M) landed inside of a unit circle. Approximate pi approx 4MN.","category":"page"},{"location":"literate/mc/mc/#Sequential-implementation:","page":"Parallel Monte Carlo","title":"Sequential implementation:","text":"","category":"section"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"function mc(N)\n M = 0 # number of darts that landed in the circle\n for i in 1:N\n if rand()^2 + rand()^2 < 1.0\n M += 1\n end\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\n\nmc(N)","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14171236","category":"page"},{"location":"literate/mc/mc/#Parallelization-with-tmapreduce","page":"Parallel Monte Carlo","title":"Parallelization with tmapreduce","text":"","category":"section"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and \"throw one dart\" per element.","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads\n\nfunction mc_parallel(N; kwargs...)\n M = tmapreduce(+, 1:N; kwargs...) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\n# or alternatively\n#\n# function mc_parallel(N)\n# M = @tasks for _ in 1:N\n# @set reducer = +\n# rand()^2 + rand()^2 < 1.0\n# end\n# pi = 4 * M / N\n# return pi\n# end\n\nmc_parallel(N)","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14156496","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Let's run a quick benchmark.","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\n@assert nthreads() > 1 # make sure we have multiple Julia threads\n@show nthreads() # print out the number of threads\n\n@btime mc($N) samples=10 evals=3;\n@btime mc_parallel($N) samples=10 evals=3;","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"nthreads() = 10\n 301.636 ms (0 allocations: 0 bytes)\n 41.864 ms (68 allocations: 5.81 KiB)\n","category":"page"},{"location":"literate/mc/mc/#Static-scheduling","page":"Parallel Monte Carlo","title":"Static scheduling","text":"","category":"section"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Because the workload is highly uniform, it makes sense to also try the StaticScheduler and compare the performance of static and dynamic scheduling (with default parameters).","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: StaticScheduler\n\n@btime mc_parallel($N; scheduler=:dynamic) samples=10 evals=3; # default\n@btime mc_parallel($N; scheduler=:static) samples=10 evals=3;","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 41.839 ms (68 allocations: 5.81 KiB)\n 41.838 ms (68 allocations: 5.81 KiB)\n","category":"page"},{"location":"literate/mc/mc/#Manual-parallelization","page":"Parallel Monte Carlo","title":"Manual parallelization","text":"","category":"section"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for pi.","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: @spawn, chunks\n\nfunction mc_parallel_manual(N; nchunks = nthreads())\n tasks = map(chunks(1:N; n = nchunks)) do idcs\n @spawn mc(length(idcs))\n end\n pi = sum(fetch, tasks) / nchunks\n return pi\nend\n\nmc_parallel_manual(N)","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14180504","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"And this is the performance:","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"@btime mc_parallel_manual($N) samples=10 evals=3;","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 30.224 ms (65 allocations: 5.70 KiB)\n","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"idcs = first(chunks(1:N; n = nthreads()))\n\n@btime mapreduce($+, $idcs) do i\n rand()^2 + rand()^2 < 1.0\nend samples=10 evals=3;\n\n@btime mc($(length(idcs))) samples=10 evals=3;","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 41.750 ms (0 allocations: 0 bytes)\n 30.148 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/experimental/","page":"Experimental","title":"Experimental","text":"CollapsedDocStrings = true","category":"page"},{"location":"refs/experimental/#Experimental","page":"Experimental","title":"Experimental","text":"","category":"section"},{"location":"refs/experimental/","page":"Experimental","title":"Experimental","text":"warning: Warning\nEverything on this page is experimental and might changed or dropped at any point!","category":"page"},{"location":"refs/experimental/#References","page":"Experimental","title":"References","text":"","category":"section"},{"location":"refs/experimental/","page":"Experimental","title":"Experimental","text":"Modules = [OhMyThreads, OhMyThreads.Experimental]\nPublic = false\nPages = [\"OhMyThreads.jl\", \"experimental.jl\"]","category":"page"},{"location":"refs/experimental/#OhMyThreads.Experimental.@barrier-Tuple","page":"Experimental","title":"OhMyThreads.Experimental.@barrier","text":"@barrier\n\nThis can be used inside a @tasks for ... end to synchronize n parallel tasks. Specifically, a task can only pass the @barrier if n-1 other tasks have reached it as well. The value of n is determined from @set ntasks=..., which is required if one wants to use @barrier.\n\nBecause this feature is experimental, it is required to load @barrier explicitly, e.g. via using OhMyThreads.Experimental: @barrier.\n\nWARNING: It is the responsibility of the user to ensure that the right number of tasks actually reach the barrier. Otherwise, a deadlock can occur. In partictular, if the number of iterations is not a multiple of n, the last few iterations (remainder) will be run by less than n tasks which will never be able to pass a @barrier.\n\nExample\n\nusing OhMyThreads: @tasks\n\n# works\n@tasks for i in 1:20\n @set ntasks = 20\n\n sleep(i * 0.2)\n println(i, \": before\")\n @barrier\n println(i, \": after\")\nend\n\n# wrong - deadlock!\n@tasks for i in 1:22 # ntasks % niterations != 0\n @set ntasks = 20\n\n println(i, \": before\")\n @barrier\n println(i, \": after\")\nend\n\n\n\n\n\n","category":"macro"},{"location":"#OhMyThreads.jl","page":"OhMyThreads","title":"OhMyThreads.jl","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-based multithreaded calculations in Julia. Most importantly, with a focus on data parallelism, it provides an API of higher-order functions (e.g. tmapreduce) as well as a macro API @tasks for ... end (conceptually similar to @threads).","category":"page"},{"location":"#Quick-Start","page":"OhMyThreads","title":"Quick Start","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The package is registered. Hence, you can simply use","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"] add OhMyThreads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"to add the package to your Julia environment.","category":"page"},{"location":"#Basic-example","page":"OhMyThreads","title":"Basic example","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"using OhMyThreads: tmapreduce, @tasks\nusing BenchmarkTools: @btime\nusing Base.Threads: nthreads\n\n# Variant 1: function API\nfunction mc_parallel(N; ntasks=nthreads())\n M = tmapreduce(+, 1:N; ntasks) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\n# Variant 2: macro API\nfunction mc_parallel_macro(N; ntasks=nthreads())\n M = @tasks for i in 1:N\n @set begin\n reducer=+\n ntasks=ntasks\n end\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\nmc_parallel(N) # gives, e.g., 3.14159924\n\n@btime mc_parallel($N; ntasks=1) # use a single task (and hence a single thread)\n@btime mc_parallel($N) # using all threads\n@btime mc_parallel_macro($N) # using all threads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"With 5 threads, timings might be something like this:","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"417.282 ms (14 allocations: 912 bytes)\n83.578 ms (38 allocations: 3.08 KiB)\n83.573 ms (38 allocations: 3.08 KiB)","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"(Check out the full Parallel Monte Carlo example if you like.)","category":"page"},{"location":"#No-Transducers","page":"OhMyThreads","title":"No Transducers","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.","category":"page"},{"location":"#Acknowledgements","page":"OhMyThreads","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.","category":"page"}] +[{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"EditURL = \"juliaset.jl\"","category":"page"},{"location":"literate/juliaset/juliaset/#Julia-Set","page":"Julia Set","title":"Julia Set","text":"","category":"section"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)\n x = -2.0 + (j - 1) * 4.0 / (n - 1)\n y = -2.0 + (i - 1) * 4.0 / (n - 1)\n\n z = x + y * im\n iter = max_iter\n for k in 1:max_iter\n if abs2(z) > 4.0\n iter = k - 1\n break\n end\n z = z^2 + c\n end\n return iter\nend","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"_compute_pixel (generic function with 1 method)","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.","category":"page"},{"location":"literate/juliaset/juliaset/#Sequential-computation","page":"Julia Set","title":"Sequential computation","text":"","category":"section"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function compute_juliaset_sequential!(img)\n N = size(img, 1)\n for j in 1:N\n for i in 1:N\n img[i, j] = _compute_pixel(i, j, N)\n end\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_sequential!(img);","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's look at the result","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using Plots\np = heatmap(img)","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"literate/juliaset/juliaset/#Parallelization","page":"Julia Set","title":"Parallelization","text":"","category":"section"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: tmap!\n\nfunction compute_juliaset_parallel!(img; kwargs...)\n N = size(img, 1)\n cart = CartesianIndices(img)\n tmap!(img, eachindex(img); kwargs...) do idx\n c = cart[idx]\n _compute_pixel(c[1], c[2], N)\n end\n return img\nend\n\n# or alternatively\n#\n# function compute_juliaset_parallel!(img; kwargs...)\n# N = size(img, 1)\n# cart = CartesianIndices(img)\n# @tasks for idx in eachindex(img)\n# c = cart[idx]\n# img[idx] = _compute_pixel(c[1], c[2], N)\n# end\n# return img\n# end\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_parallel!(img);\np = heatmap(img)","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"literate/juliaset/juliaset/#Benchmark","page":"Julia Set","title":"Benchmark","text":"","category":"section"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's benchmark the variants above.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\nN = 2000\nimg = zeros(Int, N, N)\n\n@show nthreads()\n\n@btime compute_juliaset_sequential!($img) samples=10 evals=3;\n@btime compute_juliaset_parallel!($img) samples=10 evals=3;","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"nthreads() = 10\n 131.295 ms (0 allocations: 0 bytes)\n 31.422 ms (68 allocations: 6.09 KiB)\n","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As hoped, the parallel implementation is much faster!","category":"page"},{"location":"literate/juliaset/juliaset/#Dynamic-vs-static-scheduling","page":"Julia Set","title":"Dynamic vs static scheduling","text":"","category":"section"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As stated above, the per-pixel computation is non-uniform. Hence, we do benefit from the load balancing of the default dynamic scheduler. The latter divides the overall workload into tasks that can then be dynamically distributed among threads to adjust the per-thread load. We can try to fine tune and improve the load balancing further by increasing the ntasks parameter of the scheduler, that is, creating more tasks with smaller per-task workload.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: DynamicScheduler\n\n@btime compute_juliaset_parallel!($img; ntasks=N, scheduler=:dynamic) samples=10 evals=3;","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 17.438 ms (12018 allocations: 1.11 MiB)\n","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that while this turns out to be a bit faster, it comes at the expense of much more allocations.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"To quantify the impact of load balancing we can opt out of dynamic scheduling and use the StaticScheduler instead. The latter doesn't provide any form of load balancing.","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: StaticScheduler\n\n@btime compute_juliaset_parallel!($img; scheduler=:static) samples=10 evals=3;","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 30.097 ms (73 allocations: 6.23 KiB)\n","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"","category":"page"},{"location":"literate/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"CollapsedDocStrings = true","category":"page"},{"location":"refs/internal/#Internal","page":"Internal","title":"Internal","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"warning: Warning\nEverything on this page is internal and and might changed or dropped at any point!","category":"page"},{"location":"refs/internal/#References","page":"Internal","title":"References","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Modules = [OhMyThreads, OhMyThreads.Tools]\nPublic = false\nPages = [\"OhMyThreads.jl\", \"tools.jl\"]","category":"page"},{"location":"refs/internal/#OhMyThreads.Tools.OnlyOneRegion","page":"Internal","title":"OhMyThreads.Tools.OnlyOneRegion","text":"May be used to mark a region in parallel code to be executed by a single task only (all other tasks shall skip over it).\n\nSee try_enter! and reset!.\n\n\n\n\n\n","category":"type"},{"location":"refs/internal/#OhMyThreads.Tools.SimpleBarrier","page":"Internal","title":"OhMyThreads.Tools.SimpleBarrier","text":"SimpleBarrier(n::Integer)\n\nSimple reusable barrier for n parallel tasks.\n\nGiven b = SimpleBarrier(n) and n parallel tasks, each task that calls wait(b) will block until the other n-1 tasks have called wait(b) as well.\n\nExample\n\nn = nthreads()\nbarrier = SimpleBarrier(n)\n@sync for i in 1:n\n @spawn begin\n println(\"A\")\n wait(barrier) # synchronize all tasks\n println(\"B\")\n wait(barrier) # synchronize all tasks (reusable)\n println(\"C\")\n end\nend\n\n\n\n\n\n","category":"type"},{"location":"refs/internal/#OhMyThreads.Tools.nthtid-Tuple{Any}","page":"Internal","title":"OhMyThreads.Tools.nthtid","text":"nthtid(n)\n\nReturns the thread id of the nth Julia thread in the :default threadpool.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.reset!-Tuple{OhMyThreads.Tools.OnlyOneRegion}","page":"Internal","title":"OhMyThreads.Tools.reset!","text":"Reset the OnlyOneRegion (so that it can be used again).\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.taskid-Tuple{}","page":"Internal","title":"OhMyThreads.Tools.taskid","text":"taskid() :: UInt\n\nReturn a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.try_enter!-Tuple{Any, OhMyThreads.Tools.OnlyOneRegion}","page":"Internal","title":"OhMyThreads.Tools.try_enter!","text":"try_enter!(f, s::OnlyOneRegion)\n\nWhen called from multiple parallel tasks (on a shared s::OnlyOneRegion) only a single task will execute f.\n\nExample\n\nusing OhMyThreads: @tasks\nusing OhMyThreads.Tools: OnlyOneRegion, try_enter!\n\nonly_one = OnlyOneRegion()\n\n@tasks for i in 1:10\n @set ntasks = 10\n\n println(i, \": before\")\n try_enter!(only_one) do\n println(i, \": only printed by a single task\")\n sleep(1)\n end\n println(i, \": after\")\nend\n\n\n\n\n\n","category":"method"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"CollapsedDocStrings = true","category":"page"},{"location":"refs/api/#API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"refs/api/#Exported","page":"Public API","title":"Exported","text":"","category":"section"},{"location":"refs/api/#Macros","page":"Public API","title":"Macros","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"@tasks\n@set\n@local\n@only_one\n@one_by_one","category":"page"},{"location":"refs/api/#OhMyThreads.@tasks","page":"Public API","title":"OhMyThreads.@tasks","text":"@tasks for ... end\n\nA macro to parallelize a for loop by spawning a set of tasks that can be run in parallel. The policy of how many tasks to spawn and how to distribute the iteration space among the tasks (and more) can be configured via @set statements in the loop body.\n\nSupports reductions (@set reducer=) and collecting the results (@set collect=true).\n\nUnder the hood, the for loop is translated into corresponding parallel tforeach, tmapreduce, or tmap calls.\n\nSee also: @set, @local\n\nExamples\n\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:3\n println(i)\nend\n\n@tasks for x in rand(10)\n @set reducer=+\n sin(x)\nend\n\n@tasks for i in 1:5\n @set collect=true\n i^2\nend\n\n@tasks for i in 1:100\n @set ntasks=4*nthreads()\n # non-uniform work...\nend\n\n@tasks for i in 1:5\n @set scheduler=:static\n println(\"i=\", i, \" → \", threadid())\nend\n\n@tasks for i in 1:100\n @set begin\n scheduler=:static\n chunksize=10\n end\n println(\"i=\", i, \" → \", threadid())\nend\n\n\n\n\n\n","category":"macro"},{"location":"refs/api/#OhMyThreads.@set","page":"Public API","title":"OhMyThreads.@set","text":"@set name = value\n\nThis can be used inside a @tasks for ... end block to specify settings for the parallel execution of the loop.\n\nMultiple settings are supported, either as separate @set statements or via @set begin ... end.\n\nSettings\n\nreducer (e.g. reducer=+): Indicates that a reduction should be performed with the provided binary function. See tmapreduce for more information.\ncollect (e.g. collect=true): Indicates that results should be collected (similar to map).\n\nAll other settings will be passed on to the underlying parallel functions (e.g. tmapreduce) as keyword arguments. Hence, you may provide whatever these functions accept as keyword arguments. Among others, this includes\n\nscheduler (e.g. scheduler=:static): Can be either a Scheduler or a Symbol (e.g. :dynamic, :static, :serial, or :greedy).\ninit (e.g. init=0.0): Initial value to be used in a reduction (requires reducer=...).\n\nSettings like ntasks, chunksize, and split etc. can be used to tune the scheduling policy (if the selected scheduler supports it).\n\n\n\n\n\n","category":"macro"},{"location":"refs/api/#OhMyThreads.@local","page":"Public API","title":"OhMyThreads.@local","text":"@local name = value\n\n@local name::T = value\n\nCan be used inside a @tasks for ... end block to specify task-local values (TLV) via explicitly typed assignments. These values will be allocated once per task (rather than once per iteration) and can be re-used between different task-local iterations.\n\nThere can only be a single @local block in a @tasks for ... end block. To specify multiple TLVs, use @local begin ... end. Compared to regular assignments, there are some limitations though, e.g. TLVs can't reference each other.\n\nExamples\n\nusing OhMyThreads: @tasks\nusing OhMyThreads.Tools: taskid\n\n@tasks for i in 1:10\n @set begin\n scheduler=:dynamic\n ntasks=2\n end\n @local x = zeros(3) # TLV\n\n x .+= 1\n println(taskid(), \" -> \", x)\nend\n\n@tasks for i in 1:10\n @local begin\n x = rand(Int, 3)\n M = rand(3, 3)\n end\n # ...\nend\n\nTask local variables created by @local are by default constrained to their inferred type, but if you need to, you can specify a different type during declaration:\n\n@tasks for i in 1:10\n @local x::Vector{Float64} = some_hard_to_infer_setup_function()\n # ...\nend\n\n\n\n\n\n","category":"macro"},{"location":"refs/api/#OhMyThreads.@only_one","page":"Public API","title":"OhMyThreads.@only_one","text":"@only_one begin ... end\n\nThis can be used inside a @tasks for ... end block to mark a region of code to be executed by only one of the parallel tasks (all other tasks skip over this region).\n\nExample\n\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:10\n @set ntasks = 10\n\n println(i, \": before\")\n @only_one begin\n println(i, \": only printed by a single task\")\n sleep(1)\n end\n println(i, \": after\")\nend\n\n\n\n\n\n","category":"macro"},{"location":"refs/api/#OhMyThreads.@one_by_one","page":"Public API","title":"OhMyThreads.@one_by_one","text":"@one_by_one begin ... end\n\nThis can be used inside a @tasks for ... end block to mark a region of code to be executed by one parallel task at a time (i.e. exclusive access). The order may be arbitrary and non-deterministic.\n\nExample\n\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:10\n @set ntasks = 10\n\n println(i, \": before\")\n @one_by_one begin\n println(i, \": one task at a time\")\n sleep(0.5)\n end\n println(i, \": after\")\nend\n\n\n\n\n\n","category":"macro"},{"location":"refs/api/#Functions","page":"Public API","title":"Functions","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"tmapreduce\ntreduce\ntmap\ntmap!\ntforeach\ntcollect\ntreducemap","category":"page"},{"location":"refs/api/#OhMyThreads.tmapreduce","page":"Public API","title":"OhMyThreads.tmapreduce","text":"tmapreduce(f, op, A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic],\n [outputtype::Type = Any],\n [init])\n\nA multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nExample:\n\nusing OhMyThreads: tmapreduce\n\ntmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n(√1 + √2) + (√3 + √4) + √5\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\noutputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.\ninit: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.\n\nIn addition, tmapreduce accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntmapreduce(√, +, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treduce","page":"Public API","title":"OhMyThreads.treduce","text":"treduce(op, A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic],\n [outputtype::Type = Any],\n [init])\n\nA multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nExample:\n\nusing OhMyThreads: treduce\n\ntreduce(+, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum([1, 2, 3, 4, 5]) in the form\n\n(1 + 2) + (3 + 4) + 5\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\noutputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.\ninit: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.\n\nIn addition, treduce accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntreduce(+, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap","page":"Public API","title":"OhMyThreads.tmap","text":"tmap(f, [OutputElementType], A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic])\n\nA multithreaded function like Base.map. Create a new container similar to A and fills it in parallel such that the ith element is equal to f(A[i]).\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nExample:\n\nusing OhMyThreads: tmap\n\ntmap(sin, 1:10)\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\n\nIn addition, tmap accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntmap(sin, 1:10; chunksize=2, scheduler=:static)\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap!","page":"Public API","title":"OhMyThreads.tmap!","text":"tmap!(f, out, A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic])\n\nA multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\n\nIn addition, tmap! accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. However, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tforeach","page":"Public API","title":"OhMyThreads.tforeach","text":"tforeach(f, A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic]) :: Nothing\n\nA multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing. I.e. it is the parallel equivalent of\n\nfor x in A\n f(x)\nend\n\nExample:\n\nusing OhMyThreads: tforeach\n\ntforeach(1:10) do i\n println(i^2)\nend\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\n\nIn addition, tforeach accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntforeach(1:10; chunksize=2, scheduler=:static) do i\n println(i^2)\nend\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tcollect","page":"Public API","title":"OhMyThreads.tcollect","text":"tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};\n [scheduler::Union{Scheduler, Symbol} = :dynamic])\n\nA multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nExample:\n\nusing OhMyThreads: tcollect\n\ntcollect(sin(i) for i in 1:10)\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\n\nIn addition, tcollect accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntcollect(sin(i) for i in 1:10; chunksize=2, scheduler=:static)\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treducemap","page":"Public API","title":"OhMyThreads.treducemap","text":"treducemap(op, f, A::AbstractArray...;\n [scheduler::Union{Scheduler, Symbol} = :dynamic],\n [outputtype::Type = Any],\n [init])\n\nLike tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nExample:\n\nusing OhMyThreads: treducemap\n\ntreducemap(+, √, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n(√1 + √2) + (√3 + √4) + √5\n\nKeyword arguments:\n\nscheduler::Union{Scheduler, Symbol} (default :dynamic): determines how the computation is divided into parallel tasks and how these are scheduled. See Scheduler for more information on the available schedulers.\noutputtype::Type (default Any): will work as the asserted output type of parallel calculations. We use StableTasks.jl to make setting this option unnecessary, but if you experience problems with type stability, you may be able to recover it with this keyword argument.\ninit: initial value of the reduction. Will be forwarded to mapreduce for the task-local sequential parts of the calculation.\n\nIn addition, treducemap accepts all keyword arguments that are supported by the selected scheduler. They will simply be passed on to the corresponding Scheduler constructor. Example:\n\ntreducemap(+, √, [1, 2, 3, 4, 5]; chunksize=2, scheduler=:static)\n\nHowever, to avoid ambiguity, this is currently only supported for scheduler::Symbol (but not for scheduler::Scheduler).\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#Schedulers","page":"Public API","title":"Schedulers","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"Scheduler\nDynamicScheduler\nStaticScheduler\nGreedyScheduler\nSerialScheduler","category":"page"},{"location":"refs/api/#OhMyThreads.Schedulers.Scheduler","page":"Public API","title":"OhMyThreads.Schedulers.Scheduler","text":"Supertype for all available schedulers:\n\nDynamicScheduler: default dynamic scheduler\nStaticScheduler: low-overhead static scheduler\nGreedyScheduler: greedy load-balancing scheduler\nSerialScheduler: serial (non-parallel) execution\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#OhMyThreads.Schedulers.DynamicScheduler","page":"Public API","title":"OhMyThreads.Schedulers.DynamicScheduler","text":"DynamicScheduler (aka :dynamic)\n\nThe default dynamic scheduler. Divides the given collection into chunks and then spawns a task per chunk to perform the requested operation in parallel. The tasks are assigned to threads by Julia's dynamic scheduler and are non-sticky, that is, they can migrate between threads.\n\nGenerally preferred since it is flexible, can provide load balancing, and is composable with other multithreaded code.\n\nKeyword arguments:\n\nnchunks::Integer or ntasks::Integer (default nthreads(threadpool)):\nDetermines the number of chunks (and thus also the number of parallel tasks).\nIncreasing nchunks can help with load balancing, but at the expense of creating more overhead. For nchunks <= nthreads() there are not enough chunks for any load balancing.\nSetting nchunks < nthreads() is an effective way to use only a subset of the available threads.\nchunksize::Integer (default not set)\nSpecifies the desired chunk size (instead of the number of chunks).\nThe options chunksize and nchunks/ntasks are mutually exclusive (only one may be a positive integer).\nsplit::Union{Symbol, OhMyThreads.Split} (default OhMyThreads.Consecutive()):\nDetermines how the collection is divided into chunks (if chunking=true). By default, each chunk consists of contiguous elements and order is maintained.\nSee ChunkSplitters.jl for more details and available options. We also allow users to pass :consecutive in place of Consecutive(), and :roundrobin in place of RoundRobin()\nBeware that for split=OhMyThreads.RoundRobin() the order of elements isn't maintained and a reducer function must not only be associative but also commutative!\nchunking::Bool (default true):\nControls whether input elements are grouped into chunks (true) or not (false).\nFor chunking=false, the arguments nchunks/ntasks, chunksize, and split are ignored and input elements are regarded as \"chunks\" as is. Hence, there will be one parallel task spawned per input element. Note that, depending on the input, this might spawn many(!) tasks and can be costly!\nthreadpool::Symbol (default :default):\nPossible options are :default and :interactive.\nThe high-priority pool :interactive should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes.\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#OhMyThreads.Schedulers.StaticScheduler","page":"Public API","title":"OhMyThreads.Schedulers.StaticScheduler","text":"StaticScheduler (aka :static)\n\nA static low-overhead scheduler. Divides the given collection into chunks and then spawns a task per chunk to perform the requested operation in parallel. The tasks are statically assigned to threads up front and are made sticky, that is, they are guaranteed to stay on the assigned threads (no task migration).\n\nCan sometimes be more performant than DynamicScheduler when the workload is (close to) uniform and, because of the lower overhead, for small workloads. Isn't well composable with other multithreaded code though.\n\nKeyword arguments:\n\nnchunks::Integer or ntasks::Integer (default nthreads()):\nDetermines the number of chunks (and thus also the number of parallel tasks).\nSetting nchunks < nthreads() is an effective way to use only a subset of the available threads.\nFor nchunks > nthreads() the chunks will be distributed to the available threads in a round-robin fashion.\nchunksize::Integer (default not set)\nSpecifies the desired chunk size (instead of the number of chunks).\nThe options chunksize and nchunks/ntasks are mutually exclusive (only one may be non-zero).\nchunking::Bool (default true):\nControls whether input elements are grouped into chunks (true) or not (false).\nFor chunking=false, the arguments nchunks/ntasks, chunksize, and split are ignored and input elements are regarded as \"chunks\" as is. Hence, there will be one parallel task spawned per input element. Note that, depending on the input, this might spawn many(!) tasks and can be costly!\nsplit::Union{Symbol, OhMyThreads.Split} (default OhMyThreads.Consecutive()):\nDetermines how the collection is divided into chunks. By default, each chunk consists of contiguous elements and order is maintained.\nSee ChunkSplitters.jl for more details and available options. We also allow users to pass :consecutive in place of Consecutive(), and :roundrobin in place of RoundRobin()\nBeware that for split=OhMyThreads.RoundRobin() the order of elements isn't maintained and a reducer function must not only be associative but also commutative!\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#OhMyThreads.Schedulers.GreedyScheduler","page":"Public API","title":"OhMyThreads.Schedulers.GreedyScheduler","text":"GreedyScheduler (aka :greedy)\n\nA greedy dynamic scheduler. The elements of the collection are first put into a Channel and then dynamic, non-sticky tasks are spawned to process the channel content in parallel.\n\nNote that elements are processed in a non-deterministic order, and thus a potential reducing function must be commutative in addition to being associative, or you could get incorrect results!\n\nCan be good choice for load-balancing slower, uneven computations, but does carry some additional overhead.\n\nKeyword arguments:\n\nntasks::Int (default nthreads()):\nDetermines the number of parallel tasks to be spawned.\nSetting ntasks < nthreads() is an effective way to use only a subset of the available threads.\nchunking::Bool (default false):\nControls whether input elements are grouped into chunks (true) or not (false) before put into the channel. This can improve the performance especially if there are many iterations each of which are computationally cheap.\nIf nchunks or chunksize are explicitly specified, chunking will be automatically set to true.\nnchunks::Integer (default 10 * nthreads()):\nDetermines the number of chunks (that will eventually be put into the channel).\nIncreasing nchunks can help with load balancing. For nchunks <= nthreads() there are not enough chunks for any load balancing.\nchunksize::Integer (default not set)\nSpecifies the desired chunk size (instead of the number of chunks).\nThe options chunksize and nchunks are mutually exclusive (only one may be a positive integer).\nsplit::Union{Symbol, OhMyThreads.Split} (default OhMyThreads.RoundRobin()):\nDetermines how the collection is divided into chunks (if chunking=true).\nSee ChunkSplitters.jl for more details and available options. We also allow users to pass :consecutive in place of Consecutive(), and :roundrobin in place of RoundRobin()\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#OhMyThreads.Schedulers.SerialScheduler","page":"Public API","title":"OhMyThreads.Schedulers.SerialScheduler","text":"SerialScheduler (aka :serial)\n\nA scheduler for turning off any multithreading and running the code in serial. It aims to make parallel functions like, e.g., tmapreduce(sin, +, 1:100) behave like their serial counterparts, e.g., mapreduce(sin, +, 1:100).\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#Re-exported","page":"Public API","title":"Re-exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.chunks see ChunkSplitters.jl\nOhMyThreads.index_chunks see ChunkSplitters.jl","category":"page"},{"location":"refs/api/#Public-but-not-exported","page":"Public API","title":"Public but not exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.@spawn see StableTasks.jl\nOhMyThreads.@spawnat see StableTasks.jl\nOhMyThreads.@fetch see StableTasks.jl\nOhMyThreads.@fetchfrom see StableTasks.jl\nOhMyThreads.TaskLocalValue see TaskLocalValues.jl\nOhMyThreads.Split see ChunkSplitters.jl\nOhMyThreads.Consecutive see ChunkSplitters.jl\nOhMyThreads.RoundRobin see ChunkSplitters.jl","category":"page"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"OhMyThreads.WithTaskLocals\nOhMyThreads.promise_task_local","category":"page"},{"location":"refs/api/#OhMyThreads.WithTaskLocals","page":"Public API","title":"OhMyThreads.WithTaskLocals","text":"struct WithTaskLocals{F, TLVs <: Tuple{Vararg{TaskLocalValue}}} <: Function\n\nThis callable function-like object is meant to represent a function which closes over some TaskLocalValues. This is, if you do\n\nTLV{T} = TaskLocalValue{T}\nf = WithTaskLocals((TLV{Int}(() -> 1), TLV{Int}(() -> 2))) do (x, y)\n z -> (x + y)/z\nend\n\nthen that is equivalent to\n\ng = let x = TLV{Int}(() -> 1), y = TLV{Int}(() -> 2)\n z -> let x = x[], y=y[]\n (x + y)/z\n end\nend\n\nhowever, the main difference is that you can call promise_task_local on a WithTaskLocals closure in order to turn it into something equivalent to\n\nlet x=x[], y=y[]\n z -> (x + y)/z\nend\n\nwhich doesn't have the overhead of accessing the task_local_storage each time the closure is called. This of course will lose the safety advantages of TaskLocalValue, so you should never do f_local = promise_task_local(f) and then pass f_local to some unknown function, because if that unknown function calls f_local on a new task, you'll hit a race condition.\n\n\n\n\n\n","category":"type"},{"location":"refs/api/#OhMyThreads.promise_task_local","page":"Public API","title":"OhMyThreads.promise_task_local","text":"promise_task_local(f) = f\npromise_task_local(f::WithTaskLocals) = f.inner_func(map(x -> x[], f.tasklocals))\n\nTake a WithTaskLocals closure, grab the TaskLocalValues, and passs them to the closure. That is, it turns a WithTaskLocals closure from the equivalent of\n\nTLV{T} = TaskLocalValue{T}\nlet x = TLV{Int}(() -> 1), y = TLV{Int}(() -> 2)\n z -> let x = x[], y=y[]\n (x + y)/z\n end\nend\n\ninto the equivalent of\n\nlet x = TLV{Int}(() -> 1), y = TLV{Int}(() -> 2)\n let x = x[], y = y[]\n z -> (x + y)/z\n end\nend\n\nwhich doesn't have the overhead of accessing the task_local_storage each time the closure is called. This of course will lose the safety advantages of TaskLocalValue, so you should never do f_local = promise_task_local(f) and then pass f_local to some unknown function, because if that unknown function calls f_local on a new task, you'll hit a race condition. ```\n\n\n\n\n\n","category":"function"},{"location":"basics/#Basics","page":"Basics","title":"Basics","text":"","category":"section"},{"location":"basics/","page":"Basics","title":"Basics","text":"This section is still in preparation. For now, you might want to take a look at the translation guide and the examples.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"EditURL = \"integration.jl\"","category":"page"},{"location":"literate/integration/integration/#Trapezoidal-Integration","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"section"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"In this example, we want to parallelize the computation of a simple numerical integral via the trapezoidal rule. The latter is given by","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"int_a^bf(x)dx approx h sum_i=1^Nfracf(x_i-1)+f(x_i)2","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The function to be integrated is the following.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f(x) = 4 * √(1 - x^2)","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f (generic function with 1 method)","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The analytic result of the definite integral (from 0 to 1) is known to be pi.","category":"page"},{"location":"literate/integration/integration/#Sequential","page":"Trapezoidal Integration","title":"Sequential","text":"","category":"section"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Naturally, we implement the trapezoidal rule as a straightforward, sequential for loop.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"function trapezoidal(a, b, n; h = (b - a) / n)\n y = (f(a) + f(b)) / 2.0\n for i in 1:(n - 1)\n x = a + i * h\n y = y + f(x)\n end\n return y * h\nend","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal (generic function with 1 method)","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Let's compute the integral of f above and see if we get the expected result. For simplicity, we choose N, the number of panels used to discretize the integration interval, as a multiple of the number of available Julia threads.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using Base.Threads: nthreads\n\nN = nthreads() * 1_000_000","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"10000000","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Calling trapezoidal we do indeed find the (approximate) value of pi.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal(0, 1, N) ≈ π","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"literate/integration/integration/#Parallel","page":"Trapezoidal Integration","title":"Parallel","text":"","category":"section"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Our strategy is the following: Divide the integration interval among the available Julia threads. On each thread, use the sequential trapezoidal rule to compute the partial integral. It is straightforward to implement this strategy with tmapreduce. The map part is, essentially, the application of trapezoidal and the reduction operator is chosen to be + to sum up the local integrals.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using OhMyThreads\n\nfunction trapezoidal_parallel(a, b, N)\n n = N ÷ nthreads()\n h = (b - a) / N\n return tmapreduce(+, 1:nthreads()) do i\n local α = a + (i - 1) * n * h # the local keywords aren't necessary but good practice\n local β = α + n * h\n trapezoidal(α, β, n; h)\n end\nend\n\n# or equivalently\n#\n# function trapezoidal_parallel(a, b, N)\n# n = N ÷ nthreads()\n# h = (b - a) / N\n# @tasks for i in 1:nthreads()\n# @set reducer=+\n# local α = a + (i - 1) * n * h\n# local β = α + n * h\n# trapezoidal(α, β, n; h)\n# end\n# end","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel (generic function with 1 method)","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"First, we check the correctness of our parallel implementation.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel(0, 1, N) ≈ π","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Then, we benchmark and compare the performance of the sequential and parallel versions.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using BenchmarkTools\n@btime trapezoidal(0, 1, $N);\n@btime trapezoidal_parallel(0, 1, $N);","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":" 24.348 ms (0 allocations: 0 bytes)\n 2.457 ms (69 allocations: 6.05 KiB)\n","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"nthreads()","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"10","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"page"},{"location":"literate/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"This page was generated using Literate.jl.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"EditURL = \"falsesharing.jl\"","category":"page"},{"location":"literate/falsesharing/falsesharing/#FalseSharing","page":"False Sharing","title":"False Sharing","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"False Sharing is a very common but subtle performance issue that comes up again and again when writing parallel code manually. For this reason, we shall discuss what it is about and how to avoid it.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"For simplicity, let's focus on a specific example: parallel summation.","category":"page"},{"location":"literate/falsesharing/falsesharing/#Baseline:-sequential-summation","page":"False Sharing","title":"Baseline: sequential summation","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"To establish a baseline, that we can later compare against, we define some fake data, which we'll sum up, and benchmark Julia's built-in, non-parallel sum function.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"using Base.Threads: nthreads\nusing BenchmarkTools\n\ndata = rand(1_000_000 * nthreads());\n@btime sum($data);","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":" 2.327 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"literate/falsesharing/falsesharing/#The-problematic-parallel-implementation","page":"False Sharing","title":"The problematic parallel implementation","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"A conceptually simple (and valid) approach to parallelizing the summation is to divide the full computation into parts. Specifically, the idea is to divide the data into chunks, compute the partial sums of these chunks in parallel, and finally sum up the partial results. (Note that we will not concern ourselves with potential minor or catastrophic numerical errors due to potential rearrangements of terms in the summation here.)","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"A common, manual implementation of this idea might look like this:","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"using OhMyThreads: @spawn, index_chunks\n\nfunction parallel_sum_falsesharing(data; nchunks = nthreads())\n psums = zeros(eltype(data), nchunks)\n @sync for (c, idcs) in enumerate(index_chunks(data; n = nchunks))\n @spawn begin\n for i in idcs\n psums[c] += data[i]\n end\n end\n end\n return sum(psums)\nend","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"parallel_sum_falsesharing (generic function with 1 method)","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"The code is pretty straightforward: We allocate space for the results of the partial sums (psums) and, on nchunks many tasks, add up the data elements of each partial sum in parallel. More importantly, and in this context perhaps surprisingly, the code is also correct in the sense that it produces the desired result.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"using Test\n@test sum(data) ≈ parallel_sum_falsesharing(data)","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Test Passed","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"This is just a reflection of the fact that there is no logical sharing of data - because each parallel tasks modifies a different element of psums - implying the absence of race conditions.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"What's the issue then?! Well, the sole purpose of parallelization is to reduce runtime. So let's see how well we're doing in this respect.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"nthreads()","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"10","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"@btime parallel_sum_falsesharing($data);","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":" 52.919 ms (221 allocations: 18.47 KiB)\n","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"A (huge) slowdown?! Clearly, that's the opposite of what we tried to achieve!","category":"page"},{"location":"literate/falsesharing/falsesharing/#The-issue:-False-sharing","page":"False Sharing","title":"The issue: False sharing","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Although our parallel summation above is semantically correct, it has a big performance issue: False sharing. To understand false sharing, we have to think a little bit about how computers work. Specifically, we need to realize that processors cache memory in lines (rather than individual elements) and that caches of different processors are kept coherent. When two (or more) different CPU cores operate on independent data elements that fall into the same cache line (i.e. they are part of the same memory address region) the cache coherency mechanism leads to costly synchronization between cores.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"In our case, this happens despite the fact that different parallel tasks (on different CPU cores) logically don't care about the rest of the data in the cache line at all.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"(Image: )","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Given these insights, we can come up with a few workarounds that mitigate the issue. The most prominent is probably padding, where one simply adds sufficiently many unused zeros to psums such that different partial sum counters don't fall into the same cache line. However, let's discuss a more fundamental, more efficient, and more elegant solution.","category":"page"},{"location":"literate/falsesharing/falsesharing/#Task-local-parallel-summation","page":"False Sharing","title":"Task-local parallel summation","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"The key mistake in parallel_sum_falsesharing above is the non-local modification of (implicitly) shared state (cache lines of psums) very frequently (in the innermost loop). We can simply avoid this by making the code more task-local. To this end, we introduce a task-local accumulator variable, which we use to perform the task-local partial sums. Only at the very end do we communicate the result to the main thread, e.g. by writing it into psums (once!).","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"function parallel_sum_tasklocal(data; nchunks = nthreads())\n psums = zeros(eltype(data), nchunks)\n @sync for (c, idcs) in enumerate(index_chunks(data; n = nchunks))\n @spawn begin\n local s = zero(eltype(data))\n for i in idcs\n s += data[i]\n end\n psums[c] = s\n end\n end\n return sum(psums)\nend\n\n@test sum(data) ≈ parallel_sum_tasklocal(data)\n@btime parallel_sum_tasklocal($data);","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":" 1.120 ms (221 allocations: 18.55 KiB)\n","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Finally, there is a speed up! 🎉","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Two comments are in order.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"First, we note that the only role that psums plays is as a temporary storage for the results from the parallel tasks to be able to sum them up eventually. We could get rid of it entirely by using a Threads.Atomic instead which would get updated via Threads.atomic_add! from each task directly. However, for our discussion, this is a detail and we won't discuss it further.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Secondly, while keeping the general idea, we can drastically simplify the above code by using map and reusing the built-in (sequential) sum function on each parallel task:","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"function parallel_sum_map(data; nchunks = nthreads())\n ts = map(index_chunks(data, n = nchunks)) do idcs\n @spawn @views sum(data[idcs])\n end\n return sum(fetch.(ts))\nend\n\n@test sum(data) ≈ parallel_sum_map(data)\n@btime parallel_sum_map($data);","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":" 893.396 μs (64 allocations: 5.72 KiB)\n","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"This implementation is conceptually clearer in that there is no explicit modification of shared state, i.e. no pums[c] = s, anywhere at all. We can't run into false sharing if we don't modify shared state 😉.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Note that since we use the built-in sum function, which is highly optimized, we might see better runtimes due to other effects - like SIMD and the absence of bounds checks - compared to the simple for-loop accumulation in parallel_sum_tasklocal above.","category":"page"},{"location":"literate/falsesharing/falsesharing/#Parallel-summation-with-OhMyThreads","page":"False Sharing","title":"Parallel summation with OhMyThreads","text":"","category":"section"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"Finally, all of the above is abstracted away for you if you simply use treduce to implement the parallel summation. It also only takes a single line and function call.","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"using OhMyThreads: treduce\n\n@test sum(data) ≈ treduce(+, data; ntasks = nthreads())\n@btime treduce($+, $data; ntasks = $nthreads());","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":" 899.097 μs (68 allocations: 5.92 KiB)\n","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"","category":"page"},{"location":"literate/falsesharing/falsesharing/","page":"False Sharing","title":"False Sharing","text":"This page was generated using Literate.jl.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"EditURL = \"tls.jl\"","category":"page"},{"location":"literate/tls/tls/#TSS","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"For some programs, it can be useful or even necessary to allocate and (re-)use memory in your parallel code (e.g. your computation might require temporary buffers). The following section demonstrates common issues that can arise in such a scenario and, by means of a simple example, explains techniques to handle such cases safely. Specifically, we'll dicuss (1) how task-local storage (TLS) can be used efficiently and (2) how channels can be used to organize per-task buffer allocation in a thread-safe manner.","category":"page"},{"location":"literate/tls/tls/#Test-case-(sequential)","page":"Thread-Safe Storage","title":"Test case (sequential)","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Let's say that we are given two arrays of matrices, As and Bs, and let's further assume that our goal is to compute the total sum of all pairwise matrix products. We can readily implement a (sequential) function that performs the necessary computations.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using LinearAlgebra: mul!, BLAS\nBLAS.set_num_threads(1) # for simplicity, we turn off OpenBLAS multithreading\n\nfunction matmulsums(As, Bs)\n N = size(first(As), 1)\n C = Matrix{Float64}(undef, N, N)\n map(As, Bs) do A, B\n mul!(C, A, B)\n sum(C)\n end\nend","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"matmulsums (generic function with 1 method)","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Here, we use map to perform the desired operation for each pair of matrices, A and B. However, the crucial point for our discussion is that we want to use the in-place matrix multiplication LinearAlgebra.mul! in conjunction with a pre-allocated temporary buffer, the output matrix C. This is to avoid the temporary allocation per \"iteration\" (i.e. per matrix pair) that we would get with C = A*B.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"For later comparison, we generate some random input data and store the result.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"As = [rand(256, 16) for _ in 1:768]\nBs = [rand(16, 256) for _ in 1:768]\n\nres = matmulsums(As, Bs);","category":"page"},{"location":"literate/tls/tls/#How-to-not-parallelize","page":"Thread-Safe Storage","title":"How to not parallelize","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"The key idea for creating a parallel version of matmulsums is to replace the map by OhMyThreads' parallel tmap function. However, because we re-use C, this isn't entirely trivial. Someone new to parallel computing might be tempted to parallelize matmulsums like this:","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using OhMyThreads: tmap\n\nfunction matmulsums_race(As, Bs)\n N = size(first(As), 1)\n C = Matrix{Float64}(undef, N, N)\n tmap(As, Bs) do A, B\n mul!(C, A, B)\n sum(C)\n end\nend","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"matmulsums_race (generic function with 1 method)","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Unfortunately, this doesn't produce the correct result.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"res_race = matmulsums_race(As, Bs)\nres ≈ res_race","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"false","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"In fact, it doesn't even always produce the same result (check for yourself)! The reason is that there is a race condition: different parallel tasks are trying to use the shared variable C simultaneously leading to non-deterministic behavior. Let's see how we can fix this.","category":"page"},{"location":"literate/tls/tls/#The-naive-(and-inefficient)-fix","page":"Thread-Safe Storage","title":"The naive (and inefficient) fix","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"A simple solution for the race condition issue above is to move the allocation of C into the body of the parallel tmap:","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"function matmulsums_naive(As, Bs)\n N = size(first(As), 1)\n tmap(As, Bs) do A, B\n C = Matrix{Float64}(undef, N, N)\n mul!(C, A, B)\n sum(C)\n end\nend","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"matmulsums_naive (generic function with 1 method)","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"In this case, a separate C will be allocated for each iteration such that parallel tasks no longer mutate shared state. Hence, we'll get the desired result.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"res_naive = matmulsums_naive(As, Bs)\nres ≈ res_naive","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"However, this variant is obviously inefficient because it is no better than just writing C = A*B and thus leads to one allocation per matrix pair. We need a different way of allocating and re-using C for an efficient parallel version.","category":"page"},{"location":"literate/tls/tls/#TLS","page":"Thread-Safe Storage","title":"Task-local storage","text":"","category":"section"},{"location":"literate/tls/tls/#The-manual-(and-cumbersome)-way","page":"Thread-Safe Storage","title":"The manual (and cumbersome) way","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"We've seen that we can't allocate C once up-front (→ race condition) and also shouldn't allocate it within the tmap (→ one allocation per iteration). Instead, we can assign a separate \"C\" on each parallel task once and then use this task-local \"C\" for all iterations (i.e. matrix pairs) for which this task is responsible. Before we learn how to do this more conveniently, let's implement this idea of a task-local temporary buffer (for each parallel task) manually.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using OhMyThreads: index_chunks, @spawn\nusing Base.Threads: nthreads\n\nfunction matmulsums_manual(As, Bs)\n N = size(first(As), 1)\n tasks = map(index_chunks(As; n = 2 * nthreads())) do idcs\n @spawn begin\n local C = Matrix{Float64}(undef, N, N)\n map(idcs) do i\n A = As[i]\n B = Bs[i]\n\n mul!(C, A, B)\n sum(C)\n end\n end\n end\n mapreduce(fetch, vcat, tasks)\nend\n\nres_manual = matmulsums_manual(As, Bs)\nres ≈ res_manual","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"We note that this is rather cumbersome and you might not want to write it (repeatedly). But let's take a closer look and see what's happening here. First, we divide the number of matrix pairs into 2 * nthreads() chunks. Then, for each of those chunks, we spawn a parallel task that (1) allocates a task-local C matrix (and a results vector) and (2) performs the actual computations using these pre-allocated buffers. Finally, we fetch the results of the tasks and combine them. This variant works just fine and the good news is that we can get the same behavior with less manual work.","category":"page"},{"location":"literate/tls/tls/#TLV","page":"Thread-Safe Storage","title":"The shortcut: TaskLocalValue","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"The desire for task-local storage is quite natural with task-based multithreading. For this reason, Julia supports this out of the box with Base.task_local_storage. But instead of using this directly (which you could), we will use a convenience wrapper around it called TaskLocalValue. This allows us to express the idea from above in few lines of code:","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using OhMyThreads: TaskLocalValue\n\nfunction matmulsums_tlv(As, Bs; kwargs...)\n N = size(first(As), 1)\n tlv = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))\n tmap(As, Bs; kwargs...) do A, B\n C = tlv[]\n mul!(C, A, B)\n sum(C)\n end\nend\n\nres_tlv = matmulsums_tlv(As, Bs)\nres ≈ res_tlv","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Here, TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N)) creates a task-local value - essentially a reference to a value in the task-local storage - that behaves like this: The first time the task-local value is accessed from a task (tls[]) it is initialized according to the provided anonymous function. Afterwards, every following query (from the same task!) will simply lookup and return the task-local value. This solves our issues above and leads to O(textrmparallel tasks) (instead of O(textrmiterations)) allocations.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Note that if you use our @tasks macro API, there is built-in support for task-local values via @local.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using OhMyThreads: @tasks\n\nfunction matmulsums_tlv_macro(As, Bs; kwargs...)\n N = size(first(As), 1)\n @tasks for i in eachindex(As, Bs)\n @set collect = true\n @local C = Matrix{Float64}(undef, N, N)\n mul!(C, As[i], Bs[i])\n sum(C)\n end\nend\n\nres_tlv_macro = matmulsums_tlv_macro(As, Bs)\nres ≈ res_tlv_macro","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Here, @local expands to a pattern similar to the TaskLocalValue one above, although automatically infers that the object's type is Matrix{Float64}, and it carries some optimizations (see OhMyThreads.WithTaskLocals) which can make accessing task local values more efficient in loops which take on the order of 100ns to complete.","category":"page"},{"location":"literate/tls/tls/#Benchmark","page":"Thread-Safe Storage","title":"Benchmark","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"The whole point of parallelization is increasing performance, so let's benchmark and compare the performance of the variants that we've discussed so far.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using BenchmarkTools\n\n@show nthreads()\n\n@btime matmulsums($As, $Bs);\n@btime matmulsums_naive($As, $Bs);\n@btime matmulsums_manual($As, $Bs);\n@btime matmulsums_tlv($As, $Bs);\n@btime matmulsums_tlv_macro($As, $Bs);","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"nthreads() = 10\n 61.314 ms (3 allocations: 518.17 KiB)\n 22.122 ms (1621 allocations: 384.06 MiB)\n 7.620 ms (204 allocations: 10.08 MiB)\n 7.702 ms (126 allocations: 5.03 MiB)\n 7.600 ms (127 allocations: 5.03 MiB)\n","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"As we can see, matmulsums_tlv (and matmulsums_tlv_macro) isn't only convenient but also efficient: It allocates much less memory than matmulsums_naive and is about on par with the manual implementation.","category":"page"},{"location":"literate/tls/tls/#Per-thread-allocation","page":"Thread-Safe Storage","title":"Per-thread allocation","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"The task-local solution above has one potential caveat: If we spawn many parallel tasks (e.g. for load-balancing reasons) we need just as many task-local buffers. This can clearly be suboptimal because only nthreads() tasks can run simultaneously. Hence, one buffer per thread should actually suffice. Of course, this raises the question of how to organize a pool of \"per-thread\" buffers such that each running task always has exclusive (temporary) access to a buffer (we need to make sure to avoid races).","category":"page"},{"location":"literate/tls/tls/#The-naive-(and-incorrect)-approach","page":"Thread-Safe Storage","title":"The naive (and incorrect) approach","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"A naive approach to implementing this idea is to pre-allocate an array of buffers and then to use the threadid() to select a buffer for a running task.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using Base.Threads: threadid\n\nfunction matmulsums_perthread_naive(As, Bs)\n N = size(first(As), 1)\n Cs = [Matrix{Float64}(undef, N, N) for _ in 1:nthreads()]\n tmap(As, Bs) do A, B\n C = Cs[threadid()]\n mul!(C, A, B)\n sum(C)\n end\nend\n\n# non uniform workload\nAs_nu = [rand(256, isqrt(i)^2) for i in 1:768];\nBs_nu = [rand(isqrt(i)^2, 256) for i in 1:768];\nres_nu = matmulsums(As_nu, Bs_nu);\n\nres_pt_naive = matmulsums_perthread_naive(As_nu, Bs_nu)\nres_nu ≈ res_pt_naive","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Unfortunately, this approach is generally wrong. The first issue is that threadid() doesn't necessarily start at 1 (and thus might return a value > nthreads()), in which case Cs[threadid()] would be an out-of-bounds access attempt. This might be surprising but is a simple consequence of the ordering of different kinds of Julia threads: If Julia is started with a non-zero number of interactive threads, e.g. --threads 5,2, the interactive threads come first (look at Threads.threadpool.(1:Threads.maxthreadid())).","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"But even if we account for this offset there is another, more fundamental problem, namely task-migration. By default, all spawned parallel tasks are \"non-sticky\" and can dynamically migrate between different Julia threads (loosely speaking, at any point in time). This means nothing other than that threadid() is not necessarily constant for a task! For example, imagine that task A starts on thread 4, loads the buffer Cs[4], but then gets paused, migrated, and continues executation on, say, thread 5. Afterwards, while task A is performing mul!(Cs[4], ...), a different task B might start on (the now available) thread 4 and also read and use Cs[4]. This would lead to a race condition because both tasks are mutating the same buffer. (Note that, in practice, this - most likely 😉 - doesn't happen for the very simple example above, but you can't rely on it!)","category":"page"},{"location":"literate/tls/tls/#The-quick-fix-(with-caveats)","page":"Thread-Safe Storage","title":"The quick fix (with caveats)","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"A simple solution for the task-migration issue is to opt-out of dynamic scheduling with scheduler=:static (or scheduler=StaticScheduler()). This scheduler statically assigns tasks to threads upfront without any dynamic rescheduling (the tasks are sticky and won't migrate).","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"function matmulsums_perthread_static(As, Bs)\n N = size(first(As), 1)\n Cs = [Matrix{Float64}(undef, N, N) for _ in 1:nthreads()]\n tmap(As, Bs; scheduler = :static) do A, B\n C = Cs[threadid()]\n mul!(C, A, B)\n sum(C)\n end\nend\n\nres_pt_static = matmulsums_perthread_static(As_nu, Bs_nu)\nres_nu ≈ res_pt_static","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"However, this approach doesn't solve the offset issue and, even worse, makes the parallel code non-composable: If we call other multithreaded functions within the tmap or if our parallel matmulsums_perthread_static itself gets called from another parallel region we will likely oversubscribe the Julia threads and get subpar performance. Given these caveats, we should therefore generally take a different approach.","category":"page"},{"location":"literate/tls/tls/#The-safe-way:-Channel","page":"Thread-Safe Storage","title":"The safe way: Channel","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Instead of storing the pre-allocated buffers in an array, we can put them into a Channel which internally ensures that parallel access is safe. In this scenario, we simply take! a buffer from the channel whenever we need it and put! it back after our computation is done.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"function matmulsums_perthread_channel(As, Bs; nbuffers = nthreads(), kwargs...)\n N = size(first(As), 1)\n chnl = Channel{Matrix{Float64}}(nbuffers)\n foreach(1:nbuffers) do _\n put!(chnl, Matrix{Float64}(undef, N, N))\n end\n tmap(As, Bs; kwargs...) do A, B\n C = take!(chnl)\n mul!(C, A, B)\n result = sum(C)\n put!(chnl, C)\n result\n end\nend\n\nres_pt_channel = matmulsums_perthread_channel(As_nu, Bs_nu)\nres_nu ≈ res_pt_channel","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/#Benchmark-2","page":"Thread-Safe Storage","title":"Benchmark","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Let's benchmark the variants above and compare them to the task-local implementation. We want to look at both ntasks = nthreads() and ntasks > nthreads(), the latter of which gives us dynamic load balancing.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"# no load balancing because ntasks == nthreads()\n@btime matmulsums_tlv($As_nu, $Bs_nu);\n@btime matmulsums_perthread_static($As_nu, $Bs_nu);\n@btime matmulsums_perthread_channel($As_nu, $Bs_nu);\n\n# load balancing because ntasks > nthreads()\n@btime matmulsums_tlv($As_nu, $Bs_nu; ntasks = 2 * nthreads());\n@btime matmulsums_perthread_channel($As_nu, $Bs_nu; ntasks = 2 * nthreads());\n\n@btime matmulsums_tlv($As_nu, $Bs_nu; ntasks = 10 * nthreads());\n@btime matmulsums_perthread_channel($As_nu, $Bs_nu; ntasks = 10 * nthreads());","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":" 170.563 ms (126 allocations: 5.03 MiB)\n 165.647 ms (108 allocations: 5.02 MiB)\n 172.216 ms (114 allocations: 5.02 MiB)\n 108.662 ms (237 allocations: 10.05 MiB)\n 114.673 ms (185 allocations: 5.04 MiB)\n 97.933 ms (1118 allocations: 50.13 MiB)\n 96.868 ms (746 allocations: 5.10 MiB)\n","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Note that the runtime of matmulsums_perthread_channel improves with increasing number of chunks/tasks (due to load balancing) while the amount of allocated memory doesn't increase much. Contrast this with the drastic memory increase with matmulsums_tlv.","category":"page"},{"location":"literate/tls/tls/#Another-safe-way-based-on-Channel","page":"Thread-Safe Storage","title":"Another safe way based on Channel","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Above, we chose to put a limited number of buffers (e.g. nthreads()) into the channel and then spawn many tasks (one per input element). Sometimes it can make sense to flip things around and put the (many) input elements into a channel and only spawn a limited number of tasks (e.g. nthreads()) with task-local buffers.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using OhMyThreads: tmapreduce\n\nfunction matmulsums_perthread_channel_flipped(As, Bs; ntasks = nthreads())\n N = size(first(As), 1)\n chnl = Channel{Int}(length(As); spawn = true) do chnl\n for i in 1:length(As)\n put!(chnl, i)\n end\n end\n tmapreduce(vcat, 1:ntasks; chunking=false) do _ # we turn chunking off\n local C = Matrix{Float64}(undef, N, N)\n map(chnl) do i # implicitly takes the values from the channel (parallel safe)\n A = As[i]\n B = Bs[i]\n mul!(C, A, B)\n sum(C)\n end\n end\nend;","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Note that one caveat of this approach is that the input → task assignment, and thus the order of the output, is non-deterministic. For this reason, we sort the output to check for correctness.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"res_channel_flipped = matmulsums_perthread_channel_flipped(As_nu, Bs_nu)\nsort(res_nu) ≈ sort(res_channel_flipped)","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"true","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Quick benchmark:","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"@btime matmulsums_perthread_channel_flipped($As_nu, $Bs_nu);\n@btime matmulsums_perthread_channel_flipped($As_nu, $Bs_nu; ntasks = 2 * nthreads());\n@btime matmulsums_perthread_channel_flipped($As_nu, $Bs_nu; ntasks = 10 * nthreads());","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":" 94.389 ms (170 allocations: 5.07 MiB)\n 94.580 ms (271 allocations: 10.10 MiB)\n 94.768 ms (1071 allocations: 50.41 MiB)\n","category":"page"},{"location":"literate/tls/tls/#Bumper.jl-(only-for-the-brave)","page":"Thread-Safe Storage","title":"Bumper.jl (only for the brave)","text":"","category":"section"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"If you are bold and want to cut down temporary allocations even more you can give Bumper.jl a try. Essentially, it allows you to bring your own stacks, that is, task-local bump allocators which you can dynamically allocate memory to, and reset them at the end of a code block, just like Julia's stack. Be warned though that Bumper.jl is (1) a rather young package with (likely) some bugs and (2) can easily lead to segfaults when used incorrectly. If you can live with the risk, Bumper.jl is especially useful for cases where the size of the preallocated matrix isn't known ahead of time, and even more useful if we want to do many intermediate allocations on the task, not just one. For our example, this isn't the case but let's nonetheless how one would use Bumper.jl here.","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"using Bumper\n\nfunction matmulsums_bumper(As, Bs)\n tmap(As, Bs) do A, B\n @no_escape begin # promising that no memory will escape\n N = size(A, 1)\n C = @alloc(Float64, N, N) # from bump allocater (fake \"stack\")\n mul!(C, A, B)\n sum(C)\n end\n end\nend\n\nres_bumper = matmulsums_bumper(As, Bs);\nsort(res) ≈ sort(res_bumper)\n\n@btime matmulsums_bumper($As, $Bs);","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":" 7.814 ms (134 allocations: 27.92 KiB)\n","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"Note that the benchmark is lying here about the total memory allocation, because it doesn't show the allocation of the task-local bump allocators themselves (the reason is that SlabBuffer uses malloc directly).","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"","category":"page"},{"location":"literate/tls/tls/","page":"Thread-Safe Storage","title":"Thread-Safe Storage","text":"This page was generated using Literate.jl.","category":"page"},{"location":"translation/#TG","page":"Translation Guide","title":"Translation Guide","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"This page tries to give a general overview of how to translate patterns written with the built-in tools of Base.Threads using the OhMyThreads.jl API. Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.","category":"page"},{"location":"translation/#Basics","page":"Translation Guide","title":"Basics","text":"","category":"section"},{"location":"translation/#@threads","page":"Translation Guide","title":"@threads","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\nusing Base.Threads: @threads\n\n@threads for i in 1:10\n println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:10\n println(i)\nend\n\n# or\nusing OhMyThreads: tforeach\n\ntforeach(1:10) do i\n println(i)\nend","category":"page"},{"location":"translation/#:static-scheduling","page":"Translation Guide","title":":static scheduling","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\nusing Base.Threads: @threads\n\n@threads :static for i in 1:10\n println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:10\n @set scheduler=:static\n println(i)\nend\n\n# or\nusing OhMyThreads: tforeach\n\ntforeach(1:10; scheduler=:static) do i\n println(i)\nend","category":"page"},{"location":"translation/#@spawn","page":"Translation Guide","title":"@spawn","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\nusing Base.Threads: @spawn\n\n@sync for i in 1:10\n @spawn println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\n\n@tasks for i in 1:10\n @set chunking=false\n println(i)\nend\n\n# or\nusing OhMyThreads: tforeach\n\ntforeach(1:10; chunking=false) do i\n println(i)\nend\n\n# or\nusing OhMyThreads: @spawn\n\n@sync for i in 1:10\n @spawn println(i)\nend","category":"page"},{"location":"translation/#Reduction","page":"Translation Guide","title":"Reduction","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"No built-in feature in Base.Threads.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads: basic manual implementation\nusing Base.Threads: @spawn\n\ndata = rand(10)\nchunks_itr = Iterators.partition(data, length(data) ÷ nthreads())\ntasks = map(chunks_itr) do chunk\n @spawn reduce(+, chunk)\nend\nreduce(+, fetch.(tasks))","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\ndata = rand(10)\n\n@tasks for x in data\n @set reducer=+\nend\n\n# or\nusing OhMyThreads: treduce\n\ntreduce(+, data)","category":"page"},{"location":"translation/#Mutation","page":"Translation Guide","title":"Mutation","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"warning: Warning\nParallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. false sharing). You should carefully consider whether this is necessary or whether the use of thread-safe storage is the better option. We don't recommend using the examples in this section for anything serious!","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\nusing Base.Threads: @threads\ndata = rand(10)\n\n@threads for i in eachindex(data)\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\ndata = rand(10)\n\n@tasks for i in eachindex(data)\n data[i] = calc(i)\nend\n\n# or\nusing OhMyThreads: tforeach\n\ntforeach(eachindex(data)) do i\n data[i] = calc(i)\nend\n\n# or\nusing OhMyThreads: tmap!\n\ntmap!(data, eachindex(data)) do i\n calc(i)\nend","category":"page"},{"location":"translation/#Parallel-initialization","page":"Translation Guide","title":"Parallel initialization","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"warning: Warning\nParallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. false sharing). You should carefully consider whether this is necessary or whether the use of thread-safe storage is the better option. We don't recommend using the examples in this section for anything serious!","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\nusing Base.Threads: @threads\n\ndata = Vector{Float64}(undef, 10)\n@threads for i in eachindex(data)\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\nusing OhMyThreads: @tasks\n\ndata = @tasks for i in 1:10\n @set collect=true\n calc(i)\nend\n\n# or\nusing OhMyThreads: tmap\n\ndata = tmap(i->calc(i), 1:10)\n\n# or\nusing OhMyThreads: tcollect\n\ndata = tcollect(calc(i) for i in 1:10)","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"EditURL = \"mc.jl\"","category":"page"},{"location":"literate/mc/mc/#Parallel-Monte-Carlo","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"section"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Calculate the value of pi through parallel direct Monte Carlo.","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is pi, the area of the square is 4, and the ratio is pi4. This means that, if you throw N darts randomly at the square, approximately M=Npi4 of those darts will land inside the unit circle.","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Throw darts randomly at a unit square and count how many of them (M) landed inside of a unit circle. Approximate pi approx 4MN.","category":"page"},{"location":"literate/mc/mc/#Sequential-implementation:","page":"Parallel Monte Carlo","title":"Sequential implementation:","text":"","category":"section"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"function mc(N)\n M = 0 # number of darts that landed in the circle\n for i in 1:N\n if rand()^2 + rand()^2 < 1.0\n M += 1\n end\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\n\nmc(N)","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14171236","category":"page"},{"location":"literate/mc/mc/#Parallelization-with-tmapreduce","page":"Parallel Monte Carlo","title":"Parallelization with tmapreduce","text":"","category":"section"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and \"throw one dart\" per element.","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads\n\nfunction mc_parallel(N; kwargs...)\n M = tmapreduce(+, 1:N; kwargs...) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\n# or alternatively\n#\n# function mc_parallel(N)\n# M = @tasks for _ in 1:N\n# @set reducer = +\n# rand()^2 + rand()^2 < 1.0\n# end\n# pi = 4 * M / N\n# return pi\n# end\n\nmc_parallel(N)","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14156496","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Let's run a quick benchmark.","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\n@assert nthreads() > 1 # make sure we have multiple Julia threads\n@show nthreads() # print out the number of threads\n\n@btime mc($N) samples=10 evals=3;\n@btime mc_parallel($N) samples=10 evals=3;","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"nthreads() = 10\n 301.636 ms (0 allocations: 0 bytes)\n 41.864 ms (68 allocations: 5.81 KiB)\n","category":"page"},{"location":"literate/mc/mc/#Static-scheduling","page":"Parallel Monte Carlo","title":"Static scheduling","text":"","category":"section"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Because the workload is highly uniform, it makes sense to also try the StaticScheduler and compare the performance of static and dynamic scheduling (with default parameters).","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: StaticScheduler\n\n@btime mc_parallel($N; scheduler=:dynamic) samples=10 evals=3; # default\n@btime mc_parallel($N; scheduler=:static) samples=10 evals=3;","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 41.839 ms (68 allocations: 5.81 KiB)\n 41.838 ms (68 allocations: 5.81 KiB)\n","category":"page"},{"location":"literate/mc/mc/#Manual-parallelization","page":"Parallel Monte Carlo","title":"Manual parallelization","text":"","category":"section"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"First, using the index_chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for pi.","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: @spawn, index_chunks\n\nfunction mc_parallel_manual(N; nchunks = nthreads())\n tasks = map(index_chunks(1:N; n = nchunks)) do idcs\n @spawn mc(length(idcs))\n end\n pi = sum(fetch, tasks) / nchunks\n return pi\nend\n\nmc_parallel_manual(N)","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14180504","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"And this is the performance:","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"@btime mc_parallel_manual($N) samples=10 evals=3;","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 30.224 ms (65 allocations: 5.70 KiB)\n","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"idcs = first(index_chunks(1:N; n = nthreads()))\n\n@btime mapreduce($+, $idcs) do i\n rand()^2 + rand()^2 < 1.0\nend samples=10 evals=3;\n\n@btime mc($(length(idcs))) samples=10 evals=3;","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 41.750 ms (0 allocations: 0 bytes)\n 30.148 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"page"},{"location":"literate/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/experimental/","page":"Experimental","title":"Experimental","text":"CollapsedDocStrings = true","category":"page"},{"location":"refs/experimental/#Experimental","page":"Experimental","title":"Experimental","text":"","category":"section"},{"location":"refs/experimental/","page":"Experimental","title":"Experimental","text":"warning: Warning\nEverything on this page is experimental and might changed or dropped at any point!","category":"page"},{"location":"refs/experimental/#References","page":"Experimental","title":"References","text":"","category":"section"},{"location":"refs/experimental/","page":"Experimental","title":"Experimental","text":"Modules = [OhMyThreads, OhMyThreads.Experimental]\nPublic = false\nPages = [\"OhMyThreads.jl\", \"experimental.jl\"]","category":"page"},{"location":"refs/experimental/#OhMyThreads.Experimental.@barrier-Tuple","page":"Experimental","title":"OhMyThreads.Experimental.@barrier","text":"@barrier\n\nThis can be used inside a @tasks for ... end to synchronize n parallel tasks. Specifically, a task can only pass the @barrier if n-1 other tasks have reached it as well. The value of n is determined from @set ntasks=..., which is required if one wants to use @barrier.\n\nBecause this feature is experimental, it is required to load @barrier explicitly, e.g. via using OhMyThreads.Experimental: @barrier.\n\nWARNING: It is the responsibility of the user to ensure that the right number of tasks actually reach the barrier. Otherwise, a deadlock can occur. In partictular, if the number of iterations is not a multiple of n, the last few iterations (remainder) will be run by less than n tasks which will never be able to pass a @barrier.\n\nExample\n\nusing OhMyThreads: @tasks\n\n# works\n@tasks for i in 1:20\n @set ntasks = 20\n\n sleep(i * 0.2)\n println(i, \": before\")\n @barrier\n println(i, \": after\")\nend\n\n# wrong - deadlock!\n@tasks for i in 1:22 # ntasks % niterations != 0\n @set ntasks = 20\n\n println(i, \": before\")\n @barrier\n println(i, \": after\")\nend\n\n\n\n\n\n","category":"macro"},{"location":"#OhMyThreads.jl","page":"OhMyThreads","title":"OhMyThreads.jl","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-based multithreaded calculations in Julia. Most importantly, with a focus on data parallelism, it provides an API of higher-order functions (e.g. tmapreduce) as well as a macro API @tasks for ... end (conceptually similar to @threads).","category":"page"},{"location":"#Quick-Start","page":"OhMyThreads","title":"Quick Start","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The package is registered. Hence, you can simply use","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"] add OhMyThreads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"to add the package to your Julia environment.","category":"page"},{"location":"#Basic-example","page":"OhMyThreads","title":"Basic example","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"using OhMyThreads: tmapreduce, @tasks\nusing BenchmarkTools: @btime\nusing Base.Threads: nthreads\n\n# Variant 1: function API\nfunction mc_parallel(N; ntasks=nthreads())\n M = tmapreduce(+, 1:N; ntasks) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\n# Variant 2: macro API\nfunction mc_parallel_macro(N; ntasks=nthreads())\n M = @tasks for i in 1:N\n @set begin\n reducer=+\n ntasks=ntasks\n end\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\nmc_parallel(N) # gives, e.g., 3.14159924\n\n@btime mc_parallel($N; ntasks=1) # use a single task (and hence a single thread)\n@btime mc_parallel($N) # using all threads\n@btime mc_parallel_macro($N) # using all threads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"With 5 threads, timings might be something like this:","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"417.282 ms (14 allocations: 912 bytes)\n83.578 ms (38 allocations: 3.08 KiB)\n83.573 ms (38 allocations: 3.08 KiB)","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"(Check out the full Parallel Monte Carlo example if you like.)","category":"page"},{"location":"#No-Transducers","page":"OhMyThreads","title":"No Transducers","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.","category":"page"},{"location":"#Acknowledgements","page":"OhMyThreads","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.","category":"page"}] } diff --git a/dev/translation/index.html b/dev/translation/index.html index c53f1a7..5cd6e4d 100644 --- a/dev/translation/index.html +++ b/dev/translation/index.html @@ -126,4 +126,4 @@ # or using OhMyThreads: tcollect -data = tcollect(calc(i) for i in 1:10) +data = tcollect(calc(i) for i in 1:10)