From 9f44df26ea42af6037a849529ead6f0985002c84 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Tue, 6 Feb 2024 16:18:08 +0000 Subject: [PATCH] build based on c05ffa9 --- previews/PR38/.documenter-siteinfo.json | 2 +- .../integration/integration/index.html | 2 +- .../examples/juliaset/juliaset/index.html | 2 +- previews/PR38/examples/mc/mc/index.html | 2 +- previews/PR38/examples/tls/tls.jl | 125 +++++++++++++++--- previews/PR38/examples/tls/tls/index.html | 55 ++++---- previews/PR38/index.html | 2 +- previews/PR38/refs/api/index.html | 14 +- previews/PR38/refs/internal/index.html | 2 +- previews/PR38/search_index.js | 2 +- previews/PR38/translation/index.html | 2 +- 11 files changed, 145 insertions(+), 65 deletions(-) diff --git a/previews/PR38/.documenter-siteinfo.json b/previews/PR38/.documenter-siteinfo.json index 5e928f92..0b780938 100644 --- a/previews/PR38/.documenter-siteinfo.json +++ b/previews/PR38/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-05T17:58:45","documenter_version":"1.2.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-06T16:18:05","documenter_version":"1.2.1"}} \ No newline at end of file diff --git a/previews/PR38/examples/integration/integration/index.html b/previews/PR38/examples/integration/integration/index.html index f78abf7e..8f554fca 100644 --- a/previews/PR38/examples/integration/integration/index.html +++ b/previews/PR38/examples/integration/integration/index.html @@ -22,4 +22,4 @@ @btime trapezoidal(0, 1, $N); @btime trapezoidal_parallel(0, 1, $N);
  13.871 ms (0 allocations: 0 bytes)
   2.781 ms (38 allocations: 3.19 KiB)
-

Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.


This page was generated using Literate.jl.

+

Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.


This page was generated using Literate.jl.

diff --git a/previews/PR38/examples/juliaset/juliaset/index.html b/previews/PR38/examples/juliaset/juliaset/index.html index acf270d2..e335dd45 100644 --- a/previews/PR38/examples/juliaset/juliaset/index.html +++ b/previews/PR38/examples/juliaset/juliaset/index.html @@ -52,4 +52,4 @@ 63.707 ms (39 allocations: 3.30 KiB)

As hoped, the parallel implementation is faster. But can we improve the performance further?

Tuning nchunks

As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.

@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;
  32.000 ms (12013 allocations: 1.14 MiB)
 

Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).

@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;
  63.439 ms (42 allocations: 3.37 KiB)
-

This page was generated using Literate.jl.

+

This page was generated using Literate.jl.

diff --git a/previews/PR38/examples/mc/mc/index.html b/previews/PR38/examples/mc/mc/index.html index 795fd192..7c5fa89f 100644 --- a/previews/PR38/examples/mc/mc/index.html +++ b/previews/PR38/examples/mc/mc/index.html @@ -51,4 +51,4 @@ @btime mc($(length(idcs))) samples=10 evals=3;
  87.617 ms (0 allocations: 0 bytes)
   63.398 ms (0 allocations: 0 bytes)
-

This page was generated using Literate.jl.

+

This page was generated using Literate.jl.

diff --git a/previews/PR38/examples/tls/tls.jl b/previews/PR38/examples/tls/tls.jl index d3f88c8b..f0de5de6 100644 --- a/previews/PR38/examples/tls/tls.jl +++ b/previews/PR38/examples/tls/tls.jl @@ -1,6 +1,16 @@ -using OhMyThreads: TaskLocalValue, tmap, chunks +# # Task-Local Storage +# +# For some programs, it can be useful or even necessary to allocate and (re-)use memory in +# your parallel code. The following section uses a simple example to explain how task-local +# values can be efficiently created and (re-)used. +# +# ## Sequential +# +# Let's say that we are given two arrays of (square) matrices, `As` and `Bs`, and let's +# further assume that our goal is to compute the total sum of all pairwise matrix products. +# We can readily implement a (sequential) function that performs the necessary computations. using LinearAlgebra: mul!, BLAS -using Base.Threads: nthreads, @spawn +BLAS.set_num_threads(1) # for simplicity, we turn of OpenBLAS multithreading function matmulsums(As, Bs) N = size(first(As), 1) @@ -11,6 +21,30 @@ function matmulsums(As, Bs) end end +# Here, we use `map` to perform the desired operation for each pair of matrices, +# `A` and `B`. However, the crucial point for our discussion is that we use the in-place +# matrix multiplication `LinearAlgebra.mul!` in conjunction with a pre-allocated output +# matrix `C`. This is to avoid the temporary allocation per "iteration" (i.e. per matrix +# pair) that we would get with `C = A*B`. +# +# For later comparison, we generate some random input data and store the result. + +As = [rand(1024, 1024) for _ in 1:64] +Bs = [rand(1024, 1024) for _ in 1:64] + +res = matmulsums(As, Bs); + +# ## Parallelization +# +# The key idea for creating a parallel version of `matmulsums` is to replace the `map` by +# OhMyThreads' parallel [`tmap`](@ref) function. However, because we re-use `C`, this isn't +# entirely trivial. +# +# ### The wrong way +# +# Someone new to parallel computing might be tempted to parallelize `matmulsums` like so: +using OhMyThreads: tmap + function matmulsums_race(As, Bs) N = size(first(As), 1) C = Matrix{Float64}(undef, N, N) @@ -20,6 +54,20 @@ function matmulsums_race(As, Bs) end end +# Unfortunately, this doesn't produce the correct result. + +res_race = matmulsums_race(As, Bs) +res ≈ res_race + +# In fact, It doesn't even always produce the same result (check for yourself)! +# The reason is that there is a race condition: different parallel +# tasks are trying to use the shared variable `C` simultaneously leading to +# non-deterministic behavior. Let's see how we can fix this. +# +# ### The naive (and inefficient) way +# +# A simple solution for the race condition issue above is to move the allocation of `C` +# into the body of the parallel `tmap`: function matmulsums_naive(As, Bs) N = size(first(As), 1) tmap(As, Bs) do A, B @@ -29,23 +77,61 @@ function matmulsums_naive(As, Bs) end end +# In this case, a separate `C` will be allocated for each iteration such that parallel tasks +# don't modify shared state anymore. Hence, we'll get the desired result. + +res_naive = matmulsums_naive(As, Bs) +res ≈ res_naive + +# However, this variant is obviously inefficient because it is no better than just writing +# `C = A*B` and thus leads to one allocation per matrix pair. We need a different way of +# allocating and re-using `C` for an efficient parallel version. +# +# ## The right way: `TaskLocalValue` +# +# We've seen that we can't allocate `C` once up-front (→ race condition) and also shouldn't +# allocate it within the `tmap` (→ one allocation per iteration). What we actually want is +# to once allocate a separate `C` on each parallel task and then re-use this **task-local** +# `C` for all iterations (i.e. matrix pairs) that said task is responsible for. +# +# The way to express this idea is `TaskLocalValue` and looks like this: +using OhMyThreads: TaskLocalValue + function matmulsums_tls(As, Bs) N = size(first(As), 1) - storage = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N)) + tls = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N)) tmap(As, Bs) do A, B - C = storage[] + C = tls[] mul!(C, A, B) sum(C) end end +res_tls = matmulsums_tls(As, Bs) +res ≈ res_tls + +# Here, `TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))` defines a +# task-local storage `tls` that behaves like this: The first time the storage is accessed +# (`tls[]`) from a task a task-local value is created according to the anonymous function +# (here, the task-local value will be a matrix) and stored in the storage. Afterwards, +# every other storage query from the same task(!) will simply return the task-local value. +# Hence, this is precisely what we need and will only lead to O(# parallel tasks) +# allocations. +# +# ## The performant but cumbersome way +# +# Before we benchmark and compare the performance of all discussed variants, let's implement +# the idea of a task-local `C` for each parallel task manually. +using OhMyThreads: chunks, @spawn +using Base.Threads: nthreads + function matmulsums_manual(As, Bs) N = size(first(As), 1) tasks = map(chunks(As; n = nthreads())) do idcs @spawn begin local C = Matrix{Float64}(undef, N, N) local results = Vector{Float64}(undef, length(idcs)) - @inbounds for (i, idx) in enumerate(idcs) + for (i, idx) in enumerate(idcs) mul!(C, As[idx], Bs[idx]) results[i] = sum(C) end @@ -55,25 +141,28 @@ function matmulsums_manual(As, Bs) reduce(vcat, fetch.(tasks)) end -BLAS.set_num_threads(1) # to avoid potential oversubscription - -As = [rand(1024, 1024) for _ in 1:64] -Bs = [rand(1024, 1024) for _ in 1:64] - -res = matmulsums(As, Bs) -res_race = matmulsums_race(As, Bs) -res_naive = matmulsums_naive(As, Bs) -res_tls = matmulsums_tls(As, Bs) res_manual = matmulsums_manual(As, Bs) - -res ≈ res_race -res ≈ res_naive -res ≈ res_tls res ≈ res_manual +# The first thing to note is pretty obvious: This is very cumbersome and you probably don't +# want to write it. But let's take a closer look and see what's happening here. +# First, we divide the number of matrix pairs into `nthreads()` chunks. Then, for each of +# those chunks, we spawn a parallel task that (1) allocates a task-local `C` matrix (and a +# `results` vector) and (2) performs the actual computations using these pre-allocated +# values. Finally, we `fetch` the results of the tasks and combine them. +# +# ## Benchmark +# +# The whole point of parallelization is increasing performance, so let's benchmark and +# compare the performance of the variants discussed above. + using BenchmarkTools @btime matmulsums($As, $Bs); @btime matmulsums_naive($As, $Bs); @btime matmulsums_tls($As, $Bs); @btime matmulsums_manual($As, $Bs); + +# As we see, the recommened version `matmulsums_tls` is both convenient as well as +# efficient: It allocates much less memory than `matmulsums_naive` and only slightly +# more than the manual implementation. diff --git a/previews/PR38/examples/tls/tls/index.html b/previews/PR38/examples/tls/tls/index.html index 26935816..29cc0311 100644 --- a/previews/PR38/examples/tls/tls/index.html +++ b/previews/PR38/examples/tls/tls/index.html @@ -1,7 +1,6 @@ -Task-Local Storage · OhMyThreads.jl
using OhMyThreads: TaskLocalValue, tmap, chunks
-using LinearAlgebra: mul!, BLAS
-using Base.Threads: nthreads, @spawn
+Task-Local Storage · OhMyThreads.jl

Task-Local Storage

For some programs, it can be useful or even necessary to allocate and (re-)use memory in your parallel code. The following section uses a simple example to explain how task-local values can be efficiently created and (re-)used.

Sequential

Let's say that we are given two arrays of (square) matrices, As and Bs, and let's further assume that our goal is to compute the total sum of all pairwise matrix products. We can readily implement a (sequential) function that performs the necessary computations.

using LinearAlgebra: mul!, BLAS
+BLAS.set_num_threads(1) # for simplicity, we turn of OpenBLAS multithreading
 
 function matmulsums(As, Bs)
     N = size(first(As), 1)
@@ -10,7 +9,10 @@
         mul!(C, A, B)
         sum(C)
     end
-end
+end
matmulsums (generic function with 1 method)

Here, we use map to perform the desired operation for each pair of matrices, A and B. However, the crucial point for our discussion is that we use the in-place matrix multiplication LinearAlgebra.mul! in conjunction with a pre-allocated output matrix C. This is to avoid the temporary allocation per "iteration" (i.e. per matrix pair) that we would get with C = A*B.

For later comparison, we generate some random input data and store the result.

As = [rand(1024, 1024) for _ in 1:64]
+Bs = [rand(1024, 1024) for _ in 1:64]
+
+res = matmulsums(As, Bs);

Parallelization

The key idea for creating a parallel version of matmulsums is to replace the map by OhMyThreads' parallel tmap function. However, because we re-use C, this isn't entirely trivial.

The wrong way

Someone new to parallel computing might be tempted to parallelize matmulsums like so:

using OhMyThreads: tmap
 
 function matmulsums_race(As, Bs)
     N = size(first(As), 1)
@@ -19,34 +21,38 @@
         mul!(C, A, B)
         sum(C)
     end
-end
-
-function matmulsums_naive(As, Bs)
+end
matmulsums_race (generic function with 1 method)

Unfortunately, this doesn't produce the correct result.

res_race = matmulsums_race(As, Bs)
+res ≈ res_race
false

In fact, It doesn't even always produce the same result (check for yourself)! The reason is that there is a race condition: different parallel tasks are trying to use the shared variable C simultaneously leading to non-deterministic behavior. Let's see how we can fix this.

The naive (and inefficient) way

A simple solution for the race condition issue above is to move the allocation of C into the body of the parallel tmap:

function matmulsums_naive(As, Bs)
     N = size(first(As), 1)
     tmap(As, Bs) do A, B
         C = Matrix{Float64}(undef, N, N)
         mul!(C, A, B)
         sum(C)
     end
-end
+end
matmulsums_naive (generic function with 1 method)

In this case, a separate C will be allocated for each iteration such that parallel tasks don't modify shared state anymore. Hence, we'll get the desired result.

res_naive = matmulsums_naive(As, Bs)
+res ≈ res_naive
true

However, this variant is obviously inefficient because it is no better than just writing C = A*B and thus leads to one allocation per matrix pair. We need a different way of allocating and re-using C for an efficient parallel version.

The right way: TaskLocalValue

We've seen that we can't allocate C once up-front (→ race condition) and also shouldn't allocate it within the tmap (→ one allocation per iteration). What we actually want is to once allocate a separate C on each parallel task and then re-use this task-local C for all iterations (i.e. matrix pairs) that said task is responsible for.

The way to express this idea is TaskLocalValue and looks like this:

using OhMyThreads: TaskLocalValue
 
 function matmulsums_tls(As, Bs)
     N = size(first(As), 1)
-    storage = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
+    tls = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
     tmap(As, Bs) do A, B
-        C = storage[]
+        C = tls[]
         mul!(C, A, B)
         sum(C)
     end
 end
 
+res_tls = matmulsums_tls(As, Bs)
+res ≈ res_tls
true

Here, TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N)) defines a task-local storage tls that behaves like this: The first time the storage is accessed (tls[]) from a task a task-local value is created according to the anonymous function (here, the task-local value will be a matrix) and stored in the storage. Afterwards, every other storage query from the same task(!) will simply return the task-local value. Hence, this is precisely what we need and will only lead to O(# parallel tasks) allocations.

The performant but cumbersome way

Before we benchmark and compare the performance of all discussed variants, let's implement the idea of a task-local C for each parallel task manually.

using OhMyThreads: chunks, @spawn
+using Base.Threads: nthreads
+
 function matmulsums_manual(As, Bs)
     N = size(first(As), 1)
     tasks = map(chunks(As; n = nthreads())) do idcs
         @spawn begin
             local C = Matrix{Float64}(undef, N, N)
             local results = Vector{Float64}(undef, length(idcs))
-            @inbounds for (i, idx) in enumerate(idcs)
+            for (i, idx) in enumerate(idcs)
                 mul!(C, As[idx], Bs[idx])
                 results[i] = sum(C)
             end
@@ -56,29 +62,14 @@
     reduce(vcat, fetch.(tasks))
 end
 
-BLAS.set_num_threads(1) # to avoid potential oversubscription
-
-As = [rand(1024, 1024) for _ in 1:64]
-Bs = [rand(1024, 1024) for _ in 1:64]
-
-res = matmulsums(As, Bs)
-res_race = matmulsums_race(As, Bs)
-res_naive = matmulsums_naive(As, Bs)
-res_tls = matmulsums_tls(As, Bs)
 res_manual = matmulsums_manual(As, Bs)
-
-res ≈ res_race
-res ≈ res_naive
-res ≈ res_tls
-res ≈ res_manual
-
-using BenchmarkTools
+res ≈ res_manual
true

The first thing to note is pretty obvious: This is very cumbersome and you probably don't want to write it. But let's take a closer look and see what's happening here. First, we divide the number of matrix pairs into nthreads() chunks. Then, for each of those chunks, we spawn a parallel task that (1) allocates a task-local C matrix (and a results vector) and (2) performs the actual computations using these pre-allocated values. Finally, we fetch the results of the tasks and combine them.

Benchmark

The whole point of parallelization is increasing performance, so let's benchmark and compare the performance of the variants discussed above.

using BenchmarkTools
 
 @btime matmulsums($As, $Bs);
 @btime matmulsums_naive($As, $Bs);
 @btime matmulsums_tls($As, $Bs);
-@btime matmulsums_manual($As, $Bs);
  3.107 s (3 allocations: 8.00 MiB)
-  686.432 ms (174 allocations: 512.01 MiB)
-  792.403 ms (67 allocations: 40.01 MiB)
-  684.626 ms (51 allocations: 40.00 MiB)
-

This page was generated using Literate.jl.

+@btime matmulsums_manual($As, $Bs);
  2.916 s (3 allocations: 8.00 MiB)
+  597.915 ms (174 allocations: 512.01 MiB)
+  575.507 ms (67 allocations: 40.01 MiB)
+  572.501 ms (49 allocations: 40.00 MiB)
+

As we see, the recommened version matmulsums_tls is both convenient as well as efficient: It allocates much less memory than matmulsums_naive and only slightly more than the manual implementation.


This page was generated using Literate.jl.

diff --git a/previews/PR38/index.html b/previews/PR38/index.html index d66aa83d..50a3973f 100644 --- a/previews/PR38/index.html +++ b/previews/PR38/index.html @@ -18,4 +18,4 @@ @btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread @btime mc_parallel($N) # running with all 5 Julia threads

Timings might be something like this:

438.394 ms (7 allocations: 624 bytes)
-88.050 ms (37 allocations: 3.02 KiB)

(Check out the full Parallel Monte Carlo example if you like.)

No Transducers

Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.

Acknowledgements

The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.

+88.050 ms (37 allocations: 3.02 KiB)

(Check out the full Parallel Monte Carlo example if you like.)

No Transducers

Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.

Acknowledgements

The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.

diff --git a/previews/PR38/refs/api/index.html b/previews/PR38/refs/api/index.html index c2f0308d..d15af908 100644 --- a/previews/PR38/refs/api/index.html +++ b/previews/PR38/refs/api/index.html @@ -4,29 +4,29 @@ nchunks::Int = nthreads(), split::Symbol = :batch, schedule::Symbol =:dynamic, - outputtype::Type = Any)

A multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??tmapreduce.

Example:

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

 (√1 + √2) + (√3 + √4) + √5

Extended help

Keyword arguments:

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.treduceFunction
treduce(op, A::AbstractArray...;
+           outputtype::Type = Any)

A multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??tmapreduce.

Example:

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

 (√1 + √2) + (√3 + √4) + √5

Extended help

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.treduceFunction
treduce(op, A::AbstractArray...;
         [init],
         nchunks::Int = nthreads(),
         split::Symbol = :batch,
         schedule::Symbol =:dynamic,
-        outputtype::Type = Any)

A multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??treduce.

Example:

    treduce(+, [1, 2, 3, 4, 5])

is the parallelized version of sum([1, 2, 3, 4, 5]) in the form

    (1 + 2) + (3 + 4) + 5

Extended help

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.tmapFunction
tmap(f, [OutputElementType], A::AbstractArray...;
+        outputtype::Type = Any)

A multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??treduce.

Example:

    treduce(+, [1, 2, 3, 4, 5])

is the parallelized version of sum([1, 2, 3, 4, 5]) in the form

    (1 + 2) + (3 + 4) + 5

Extended help

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.tmapFunction
tmap(f, [OutputElementType], A::AbstractArray...;
      nchunks::Int = nthreads(),
      split::Symbol = :batch,
-     schedule::Symbol =:dynamic)

A multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel: the data is divided into chunks and a parallel task is created per chunk.

The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

To see the keyword argument options, check out ??tmap.

Example:

    tmap(sin, 1:10)

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tmap!Function
tmap!(f, out, A::AbstractArray...;
+     schedule::Symbol =:dynamic)

A multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel: the data is divided into chunks and a parallel task is created per chunk.

The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

To see the keyword argument options, check out ??tmap.

Example:

    tmap(sin, 1:10)

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tmap!Function
tmap!(f, out, A::AbstractArray...;
       nchunks::Int = nthreads(),
       split::Symbol = :batch,
-      schedule::Symbol =:dynamic)

A multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??tmap!.

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tforeachFunction
tforeach(f, A::AbstractArray...;
+      schedule::Symbol =:dynamic)

A multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??tmap!.

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tforeachFunction
tforeach(f, A::AbstractArray...;
          nchunks::Int = nthreads(),
          split::Symbol = :batch,
          schedule::Symbol =:dynamic) :: Nothing

A multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing. I.e. it is the parallel equivalent of

for x in A
     f(x)
 end

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??tforeach.

Example:

    tforeach(1:10) do i
         println(i^2)
-    end

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tcollectFunction
tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
+    end

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tcollectFunction
tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
          nchunks::Int = nthreads(),
-         schedule::Symbol =:dynamic)

A multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.

The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

To see the keyword argument options, check out ??tcollect.

Example:

    tcollect(sin(i) for i in 1:10)

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.treducemapFunction
treducemap(op, f, A::AbstractArray...;
+         schedule::Symbol =:dynamic)

A multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.

The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

To see the keyword argument options, check out ??tcollect.

Example:

    tcollect(sin(i) for i in 1:10)

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.treducemapFunction
treducemap(op, f, A::AbstractArray...;
            [init],
            nchunks::Int = nthreads(),
            split::Symbol = :batch,
            schedule::Symbol =:dynamic,
-           outputtype::Type = Any)

Like tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??treducemap.

Example:

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

 (√1 + √2) + (√3 + √4) + √5

Extended help

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source

as well as the following re-exported functions:

chunkssee ChunkSplitters.jl

Non-Exported

OhMyThreads.@spawnsee StableTasks.jl
OhMyThreads.@spawnatsee StableTasks.jl
+ outputtype::Type = Any)

Like tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??treducemap.

Example:

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

 (√1 + √2) + (√3 + √4) + √5

Extended help

Keyword arguments:

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source

as well as the following re-exported functions:

chunkssee ChunkSplitters.jl

Non-Exported

OhMyThreads.@spawnsee StableTasks.jl
OhMyThreads.@spawnatsee StableTasks.jl
diff --git a/previews/PR38/refs/internal/index.html b/previews/PR38/refs/internal/index.html index 2bb6fe03..c28ab351 100644 --- a/previews/PR38/refs/internal/index.html +++ b/previews/PR38/refs/internal/index.html @@ -1,2 +1,2 @@ -Internal · OhMyThreads.jl
+Internal · OhMyThreads.jl
diff --git a/previews/PR38/search_index.js b/previews/PR38/search_index.js index 42994573..cbb8e39e 100644 --- a/previews/PR38/search_index.js +++ b/previews/PR38/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"EditURL = \"juliaset.jl\"","category":"page"},{"location":"examples/juliaset/juliaset/#Julia-Set","page":"Julia Set","title":"Julia Set","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)\n x = -2.0 + (j - 1) * 4.0 / (n - 1)\n y = -2.0 + (i - 1) * 4.0 / (n - 1)\n\n z = x + y * im\n iter = max_iter\n for k in 1:max_iter\n if abs2(z) > 4.0\n iter = k - 1\n break\n end\n z = z^2 + c\n end\n return iter\nend","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"_compute_pixel (generic function with 1 method)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.","category":"page"},{"location":"examples/juliaset/juliaset/#Sequential-computation","page":"Julia Set","title":"Sequential computation","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function compute_juliaset_sequential!(img)\n N = size(img, 1)\n for j in 1:N\n for i in 1:N\n img[i, j] = _compute_pixel(i, j, N)\n end\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_sequential!(img);","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's look at the result","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using Plots\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Parallelization","page":"Julia Set","title":"Parallelization","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: tmap!\n\nfunction compute_juliaset_parallel!(img; kwargs...)\n N = size(img, 1)\n cart = CartesianIndices(img)\n tmap!(img, eachindex(img); kwargs...) do idx\n c = cart[idx]\n _compute_pixel(c[1], c[2], N)\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_parallel!(img);\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Benchmark","page":"Julia Set","title":"Benchmark","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's benchmark the variants above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\nN = 2000\nimg = zeros(Int, N, N)\n\n@btime compute_juliaset_sequential!($img) samples=10 evals=3;\n@btime compute_juliaset_parallel!($img) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 138.377 ms (0 allocations: 0 bytes)\n 63.707 ms (39 allocations: 3.30 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As hoped, the parallel implementation is faster. But can we improve the performance further?","category":"page"},{"location":"examples/juliaset/juliaset/#Tuning-nchunks","page":"Julia Set","title":"Tuning nchunks","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 32.000 ms (12013 allocations: 1.14 MiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 63.439 ms (42 allocations: 3.37 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"This page was generated using Literate.jl.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"EditURL = \"mc.jl\"","category":"page"},{"location":"examples/mc/mc/#Parallel-Monte-Carlo","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Calculate the value of pi through parallel direct Monte Carlo.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is pi, the area of the square is 4, and the ratio is pi4. This means that, if you throw N darts randomly at the square, approximately M=Npi4 of those darts will land inside the unit circle.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Throw darts randomly at a unit square and count how many of them (M) landed inside of a unit circle. Approximate pi approx 4MN.","category":"page"},{"location":"examples/mc/mc/#Sequential-implementation:","page":"Parallel Monte Carlo","title":"Sequential implementation:","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"function mc(N)\n M = 0 # number of darts that landed in the circle\n for i in 1:N\n if rand()^2 + rand()^2 < 1.0\n M += 1\n end\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\n\nmc(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.141517","category":"page"},{"location":"examples/mc/mc/#Parallelization-with-tmapreduce","page":"Parallel Monte Carlo","title":"Parallelization with tmapreduce","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and \"throw one dart\" per element.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads\n\nfunction mc_parallel(N)\n M = tmapreduce(+, 1:N) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nmc_parallel(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14159924","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Let's run a quick benchmark.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\n@assert nthreads() > 1 # make sure we have multiple Julia threads\n@show nthreads() # print out the number of threads\n\n@btime mc($N) samples=10 evals=3;\n@btime mc_parallel($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"nthreads() = 5\n 318.467 ms (0 allocations: 0 bytes)\n 88.553 ms (37 allocations: 3.02 KiB)\n","category":"page"},{"location":"examples/mc/mc/#Manual-parallelization","page":"Parallel Monte Carlo","title":"Manual parallelization","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for pi.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: @spawn\n\nfunction mc_parallel_manual(N; nchunks = nthreads())\n tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready\n @spawn mc(length(idcs))\n end\n pi = sum(fetch, tasks) / nchunks\n return pi\nend\n\nmc_parallel_manual(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.1415844","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"And this is the performance:","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"@btime mc_parallel_manual($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 63.825 ms (31 allocations: 2.80 KiB)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"idcs = first(chunks(1:N; n = nthreads()))\n\n@btime mapreduce($+, $idcs) do i\n rand()^2 + rand()^2 < 1.0\nend samples=10 evals=3;\n\n@btime mc($(length(idcs))) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 87.617 ms (0 allocations: 0 bytes)\n 63.398 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/internal/#Internal","page":"Internal","title":"Internal","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"The following is internal, i.e. not public API, and might change at any point.","category":"page"},{"location":"refs/internal/#Index","page":"Internal","title":"Index","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Pages = [\"internal.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/internal/#References","page":"Internal","title":"References","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Modules = [OhMyThreads, OhMyThreads.Tools]\nPublic = false\nPages = [\"OhMyThreads.jl\", \"tools.jl\"]","category":"page"},{"location":"refs/internal/#OhMyThreads.Tools.nthtid-Tuple{Any}","page":"Internal","title":"OhMyThreads.Tools.nthtid","text":"Returns the thread id of the nth Julia thread in the :default threadpool.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.taskid-Tuple{}","page":"Internal","title":"OhMyThreads.Tools.taskid","text":"taskid() :: UInt\n\nReturn a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist. \n\n\n\n\n\n","category":"method"},{"location":"refs/api/#API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"refs/api/#Index","page":"Public API","title":"Index","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"Pages = [\"api.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/api/#Exported","page":"Public API","title":"Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"tmapreduce\ntreduce\ntmap\ntmap!\ntforeach\ntcollect\ntreducemap","category":"page"},{"location":"refs/api/#OhMyThreads.tmapreduce","page":"Public API","title":"OhMyThreads.tmapreduce","text":"tmapreduce(f, op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tmapreduce.\n\nExample:\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n (√1 + √2) + (√3 + √4) + √5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treduce","page":"Public API","title":"OhMyThreads.treduce","text":"treduce(op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??treduce.\n\nExample:\n\n treduce(+, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum([1, 2, 3, 4, 5]) in the form\n\n (1 + 2) + (3 + 4) + 5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap","page":"Public API","title":"OhMyThreads.tmap","text":"tmap(f, [OutputElementType], A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel: the data is divided into chunks and a parallel task is created per chunk.\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nTo see the keyword argument options, check out ??tmap.\n\nExample:\n\n tmap(sin, 1:10)\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap!","page":"Public API","title":"OhMyThreads.tmap!","text":"tmap!(f, out, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tmap!.\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tforeach","page":"Public API","title":"OhMyThreads.tforeach","text":"tforeach(f, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic) :: Nothing\n\nA multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing. I.e. it is the parallel equivalent of\n\nfor x in A\n f(x)\nend\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tforeach.\n\nExample:\n\n tforeach(1:10) do i\n println(i^2)\n end\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tcollect","page":"Public API","title":"OhMyThreads.tcollect","text":"tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};\n nchunks::Int = nthreads(),\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nTo see the keyword argument options, check out ??tcollect.\n\nExample:\n\n tcollect(sin(i) for i in 1:10)\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treducemap","page":"Public API","title":"OhMyThreads.treducemap","text":"treducemap(op, f, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nLike tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??treducemap.\n\nExample:\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n (√1 + √2) + (√3 + √4) + √5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"as well as the following re-exported functions:","category":"page"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nchunks see ChunkSplitters.jl","category":"page"},{"location":"refs/api/#Non-Exported","page":"Public API","title":"Non-Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.@spawn see StableTasks.jl\nOhMyThreads.@spawnat see StableTasks.jl","category":"page"},{"location":"translation/#Translation-Guide","page":"Translation Guide","title":"Translation Guide","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"This page tries to give a general overview of how to translate patterns written with the built-in tools of Base.Threads using the OhMyThreads.jl API. Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.","category":"page"},{"location":"translation/#Basics","page":"Translation Guide","title":"Basics","text":"","category":"section"},{"location":"translation/#@threads","page":"Translation Guide","title":"@threads","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@threads for i in 1:10\n println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10) do i\n println(i)\nend","category":"page"},{"location":"translation/#:static-scheduling","page":"Translation Guide","title":":static scheduling","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@threads :static for i in 1:10\n println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10; schedule=:static) do i\n println(i)\nend","category":"page"},{"location":"translation/#@spawn","page":"Translation Guide","title":"@spawn","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@sync for i in 1:10\n @spawn println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10; nchunks=10) do i\n println(i)\nend","category":"page"},{"location":"translation/#Reduction","page":"Translation Guide","title":"Reduction","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"No built-in feature in Base.Threads.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads: basic manual implementation\ndata = rand(10)\nchunks_itr = Iterators.partition(data, length(data) ÷ nthreads())\ntasks = map(chunks_itr) do chunk\n @spawn reduce(+, chunk)\nend\nreduce(+, fetch.(tasks))","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ndata = rand(10)\ntreduce(+, data)","category":"page"},{"location":"translation/#Mutation","page":"Translation Guide","title":"Mutation","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"warning: Warning\nParallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. false sharing). You should carefully consider whether this is necessary or whether the use of task-local storage is the better option.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\ndata = rand(10)\n@threads for i in 1:10\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 1\ndata = rand(10)\ntforeach(data) do i\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 2\ndata = rand(10)\ntmap!(data, data) do i # this kind of aliasing is fine\n calc(i)\nend","category":"page"},{"location":"translation/#Parallel-initialization","page":"Translation Guide","title":"Parallel initialization","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\ndata = Vector{Float64}(undef, 10)\n@threads for i in 1:10\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 1\ndata = tmap(i->calc(i), 1:10)","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 2\ndata = tcollect(calc(i) for i in 1:10)","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"EditURL = \"tls.jl\"","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using OhMyThreads: TaskLocalValue, tmap, chunks\nusing LinearAlgebra: mul!, BLAS\nusing Base.Threads: nthreads, @spawn\n\nfunction matmulsums(As, Bs)\n N = size(first(As), 1)\n C = Matrix{Float64}(undef, N, N)\n map(As, Bs) do A, B\n mul!(C, A, B)\n sum(C)\n end\nend\n\nfunction matmulsums_race(As, Bs)\n N = size(first(As), 1)\n C = Matrix{Float64}(undef, N, N)\n tmap(As, Bs) do A, B\n mul!(C, A, B)\n sum(C)\n end\nend\n\nfunction matmulsums_naive(As, Bs)\n N = size(first(As), 1)\n tmap(As, Bs) do A, B\n C = Matrix{Float64}(undef, N, N)\n mul!(C, A, B)\n sum(C)\n end\nend\n\nfunction matmulsums_tls(As, Bs)\n N = size(first(As), 1)\n storage = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))\n tmap(As, Bs) do A, B\n C = storage[]\n mul!(C, A, B)\n sum(C)\n end\nend\n\nfunction matmulsums_manual(As, Bs)\n N = size(first(As), 1)\n tasks = map(chunks(As; n = nthreads())) do idcs\n @spawn begin\n local C = Matrix{Float64}(undef, N, N)\n local results = Vector{Float64}(undef, length(idcs))\n @inbounds for (i, idx) in enumerate(idcs)\n mul!(C, As[idx], Bs[idx])\n results[i] = sum(C)\n end\n results\n end\n end\n reduce(vcat, fetch.(tasks))\nend\n\nBLAS.set_num_threads(1) # to avoid potential oversubscription\n\nAs = [rand(1024, 1024) for _ in 1:64]\nBs = [rand(1024, 1024) for _ in 1:64]\n\nres = matmulsums(As, Bs)\nres_race = matmulsums_race(As, Bs)\nres_naive = matmulsums_naive(As, Bs)\nres_tls = matmulsums_tls(As, Bs)\nres_manual = matmulsums_manual(As, Bs)\n\nres ≈ res_race\nres ≈ res_naive\nres ≈ res_tls\nres ≈ res_manual\n\nusing BenchmarkTools\n\n@btime matmulsums($As, $Bs);\n@btime matmulsums_naive($As, $Bs);\n@btime matmulsums_tls($As, $Bs);\n@btime matmulsums_manual($As, $Bs);","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":" 3.107 s (3 allocations: 8.00 MiB)\n 686.432 ms (174 allocations: 512.01 MiB)\n 792.403 ms (67 allocations: 40.01 MiB)\n 684.626 ms (51 allocations: 40.00 MiB)\n","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"This page was generated using Literate.jl.","category":"page"},{"location":"#OhMyThreads.jl","page":"OhMyThreads","title":"OhMyThreads.jl","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a focus on data parallelism, that can be used without having to worry much about manual Task creation.","category":"page"},{"location":"#Quick-Start","page":"OhMyThreads","title":"Quick Start","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The package is registered. Hence, you can simply use","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"] add OhMyThreads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"to add the package to your Julia environment.","category":"page"},{"location":"#Basic-example","page":"OhMyThreads","title":"Basic example","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"using OhMyThreads\n\nfunction mc_parallel(N; kw...)\n M = tmapreduce(+, 1:N; kw...) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\nmc_parallel(N) # gives, e.g., 3.14159924\n\nusing BenchmarkTools\n\n@show Threads.nthreads() # 5 in this example\n\n@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread\n@btime mc_parallel($N) # running with all 5 Julia threads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Timings might be something like this:","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"438.394 ms (7 allocations: 624 bytes)\n88.050 ms (37 allocations: 3.02 KiB)","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"(Check out the full Parallel Monte Carlo example if you like.)","category":"page"},{"location":"#No-Transducers","page":"OhMyThreads","title":"No Transducers","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.","category":"page"},{"location":"#Acknowledgements","page":"OhMyThreads","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"EditURL = \"integration.jl\"","category":"page"},{"location":"examples/integration/integration/#Trapezoidal-Integration","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"In this example, we want to parallelize the computation of a simple numerical integral via the trapezoidal rule. The latter is given by","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"int_a^bf(x)dx approx h sum_i=1^Nfracf(x_i-1)+f(x_i)2","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The function to be integrated is the following.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f(x) = 4 * √(1 - x^2)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The analytic result of the definite integral (from 0 to 1) is known to be pi.","category":"page"},{"location":"examples/integration/integration/#Sequential","page":"Trapezoidal Integration","title":"Sequential","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Naturally, we implement the trapezoidal rule as a straightforward, sequential for loop.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"function trapezoidal(a, b, n; h = (b - a) / n)\n y = (f(a) + f(b)) / 2.0\n for i in 1:(n - 1)\n x = a + i * h\n y = y + f(x)\n end\n return y * h\nend","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Let's compute the integral of f above and see if we get the expected result. For simplicity, we choose N, the number of panels used to discretize the integration interval, as a multiple of the number of available Julia threads.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using Base.Threads: nthreads\n@show nthreads()\nN = nthreads() * 1_000_000","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"5000000","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Calling trapezoidal we do indeed find the (approximate) value of pi.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal(0, 1, N) ≈ π","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"examples/integration/integration/#Parallel","page":"Trapezoidal Integration","title":"Parallel","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Our strategy is the following: Divide the integration interval among the available Julia threads. On each thread, use the sequential trapezoidal rule to compute the partial integral. It is straightforward to implement this strategy with tmapreduce. The map part is, essentially, the application of trapezoidal and the reduction operator is chosen to be + to sum up the local integrals.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using OhMyThreads\n\nfunction trapezoidal_parallel(a, b, N)\n n = N ÷ nthreads()\n h = (b - a) / N\n return tmapreduce(+, 1:nthreads()) do i\n local α = a + (i - 1) * n * h\n local β = α + n * h\n trapezoidal(α, β, n; h)\n end\nend","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"First, we check the correctness of our parallel implementation.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel(0, 1, N) ≈ π","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Then, we benchmark and compare the performance of the sequential and parallel versions.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using BenchmarkTools\n@btime trapezoidal(0, 1, $N);\n@btime trapezoidal_parallel(0, 1, $N);","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":" 13.871 ms (0 allocations: 0 bytes)\n 2.781 ms (38 allocations: 3.19 KiB)\n","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"This page was generated using Literate.jl.","category":"page"}] +[{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"EditURL = \"juliaset.jl\"","category":"page"},{"location":"examples/juliaset/juliaset/#Julia-Set","page":"Julia Set","title":"Julia Set","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)\n x = -2.0 + (j - 1) * 4.0 / (n - 1)\n y = -2.0 + (i - 1) * 4.0 / (n - 1)\n\n z = x + y * im\n iter = max_iter\n for k in 1:max_iter\n if abs2(z) > 4.0\n iter = k - 1\n break\n end\n z = z^2 + c\n end\n return iter\nend","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"_compute_pixel (generic function with 1 method)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.","category":"page"},{"location":"examples/juliaset/juliaset/#Sequential-computation","page":"Julia Set","title":"Sequential computation","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function compute_juliaset_sequential!(img)\n N = size(img, 1)\n for j in 1:N\n for i in 1:N\n img[i, j] = _compute_pixel(i, j, N)\n end\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_sequential!(img);","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's look at the result","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using Plots\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Parallelization","page":"Julia Set","title":"Parallelization","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: tmap!\n\nfunction compute_juliaset_parallel!(img; kwargs...)\n N = size(img, 1)\n cart = CartesianIndices(img)\n tmap!(img, eachindex(img); kwargs...) do idx\n c = cart[idx]\n _compute_pixel(c[1], c[2], N)\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_parallel!(img);\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Benchmark","page":"Julia Set","title":"Benchmark","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's benchmark the variants above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\nN = 2000\nimg = zeros(Int, N, N)\n\n@btime compute_juliaset_sequential!($img) samples=10 evals=3;\n@btime compute_juliaset_parallel!($img) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 138.377 ms (0 allocations: 0 bytes)\n 63.707 ms (39 allocations: 3.30 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As hoped, the parallel implementation is faster. But can we improve the performance further?","category":"page"},{"location":"examples/juliaset/juliaset/#Tuning-nchunks","page":"Julia Set","title":"Tuning nchunks","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 32.000 ms (12013 allocations: 1.14 MiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 63.439 ms (42 allocations: 3.37 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"This page was generated using Literate.jl.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"EditURL = \"mc.jl\"","category":"page"},{"location":"examples/mc/mc/#Parallel-Monte-Carlo","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Calculate the value of pi through parallel direct Monte Carlo.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is pi, the area of the square is 4, and the ratio is pi4. This means that, if you throw N darts randomly at the square, approximately M=Npi4 of those darts will land inside the unit circle.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Throw darts randomly at a unit square and count how many of them (M) landed inside of a unit circle. Approximate pi approx 4MN.","category":"page"},{"location":"examples/mc/mc/#Sequential-implementation:","page":"Parallel Monte Carlo","title":"Sequential implementation:","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"function mc(N)\n M = 0 # number of darts that landed in the circle\n for i in 1:N\n if rand()^2 + rand()^2 < 1.0\n M += 1\n end\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\n\nmc(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.141517","category":"page"},{"location":"examples/mc/mc/#Parallelization-with-tmapreduce","page":"Parallel Monte Carlo","title":"Parallelization with tmapreduce","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and \"throw one dart\" per element.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads\n\nfunction mc_parallel(N)\n M = tmapreduce(+, 1:N) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nmc_parallel(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14159924","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Let's run a quick benchmark.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\n@assert nthreads() > 1 # make sure we have multiple Julia threads\n@show nthreads() # print out the number of threads\n\n@btime mc($N) samples=10 evals=3;\n@btime mc_parallel($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"nthreads() = 5\n 318.467 ms (0 allocations: 0 bytes)\n 88.553 ms (37 allocations: 3.02 KiB)\n","category":"page"},{"location":"examples/mc/mc/#Manual-parallelization","page":"Parallel Monte Carlo","title":"Manual parallelization","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for pi.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: @spawn\n\nfunction mc_parallel_manual(N; nchunks = nthreads())\n tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready\n @spawn mc(length(idcs))\n end\n pi = sum(fetch, tasks) / nchunks\n return pi\nend\n\nmc_parallel_manual(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.1415844","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"And this is the performance:","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"@btime mc_parallel_manual($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 63.825 ms (31 allocations: 2.80 KiB)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"idcs = first(chunks(1:N; n = nthreads()))\n\n@btime mapreduce($+, $idcs) do i\n rand()^2 + rand()^2 < 1.0\nend samples=10 evals=3;\n\n@btime mc($(length(idcs))) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 87.617 ms (0 allocations: 0 bytes)\n 63.398 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/internal/#Internal","page":"Internal","title":"Internal","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"The following is internal, i.e. not public API, and might change at any point.","category":"page"},{"location":"refs/internal/#Index","page":"Internal","title":"Index","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Pages = [\"internal.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/internal/#References","page":"Internal","title":"References","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Modules = [OhMyThreads, OhMyThreads.Tools]\nPublic = false\nPages = [\"OhMyThreads.jl\", \"tools.jl\"]","category":"page"},{"location":"refs/internal/#OhMyThreads.Tools.nthtid-Tuple{Any}","page":"Internal","title":"OhMyThreads.Tools.nthtid","text":"Returns the thread id of the nth Julia thread in the :default threadpool.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.taskid-Tuple{}","page":"Internal","title":"OhMyThreads.Tools.taskid","text":"taskid() :: UInt\n\nReturn a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist. \n\n\n\n\n\n","category":"method"},{"location":"refs/api/#API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"refs/api/#Index","page":"Public API","title":"Index","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"Pages = [\"api.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/api/#Exported","page":"Public API","title":"Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"tmapreduce\ntreduce\ntmap\ntmap!\ntforeach\ntcollect\ntreducemap","category":"page"},{"location":"refs/api/#OhMyThreads.tmapreduce","page":"Public API","title":"OhMyThreads.tmapreduce","text":"tmapreduce(f, op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tmapreduce.\n\nExample:\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n (√1 + √2) + (√3 + √4) + √5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treduce","page":"Public API","title":"OhMyThreads.treduce","text":"treduce(op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??treduce.\n\nExample:\n\n treduce(+, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum([1, 2, 3, 4, 5]) in the form\n\n (1 + 2) + (3 + 4) + 5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap","page":"Public API","title":"OhMyThreads.tmap","text":"tmap(f, [OutputElementType], A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel: the data is divided into chunks and a parallel task is created per chunk.\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nTo see the keyword argument options, check out ??tmap.\n\nExample:\n\n tmap(sin, 1:10)\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap!","page":"Public API","title":"OhMyThreads.tmap!","text":"tmap!(f, out, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tmap!.\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tforeach","page":"Public API","title":"OhMyThreads.tforeach","text":"tforeach(f, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic) :: Nothing\n\nA multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing. I.e. it is the parallel equivalent of\n\nfor x in A\n f(x)\nend\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??tforeach.\n\nExample:\n\n tforeach(1:10) do i\n println(i^2)\n end\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tcollect","page":"Public API","title":"OhMyThreads.tcollect","text":"tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};\n nchunks::Int = nthreads(),\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.\n\nThe optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nTo see the keyword argument options, check out ??tcollect.\n\nExample:\n\n tcollect(sin(i) for i in 1:10)\n\nExtended help\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treducemap","page":"Public API","title":"OhMyThreads.treducemap","text":"treducemap(op, f, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nLike tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.\n\nNote that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor parallelization, the data is divided into chunks and a parallel task is created per chunk.\n\nTo see the keyword argument options, check out ??treducemap.\n\nExample:\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form\n\n (√1 + √2) + (√3 + √4) + √5\n\nExtended help\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"as well as the following re-exported functions:","category":"page"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nchunks see ChunkSplitters.jl","category":"page"},{"location":"refs/api/#Non-Exported","page":"Public API","title":"Non-Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.@spawn see StableTasks.jl\nOhMyThreads.@spawnat see StableTasks.jl","category":"page"},{"location":"translation/#Translation-Guide","page":"Translation Guide","title":"Translation Guide","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"This page tries to give a general overview of how to translate patterns written with the built-in tools of Base.Threads using the OhMyThreads.jl API. Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.","category":"page"},{"location":"translation/#Basics","page":"Translation Guide","title":"Basics","text":"","category":"section"},{"location":"translation/#@threads","page":"Translation Guide","title":"@threads","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@threads for i in 1:10\n println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10) do i\n println(i)\nend","category":"page"},{"location":"translation/#:static-scheduling","page":"Translation Guide","title":":static scheduling","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@threads :static for i in 1:10\n println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10; schedule=:static) do i\n println(i)\nend","category":"page"},{"location":"translation/#@spawn","page":"Translation Guide","title":"@spawn","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@sync for i in 1:10\n @spawn println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10; nchunks=10) do i\n println(i)\nend","category":"page"},{"location":"translation/#Reduction","page":"Translation Guide","title":"Reduction","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"No built-in feature in Base.Threads.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads: basic manual implementation\ndata = rand(10)\nchunks_itr = Iterators.partition(data, length(data) ÷ nthreads())\ntasks = map(chunks_itr) do chunk\n @spawn reduce(+, chunk)\nend\nreduce(+, fetch.(tasks))","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ndata = rand(10)\ntreduce(+, data)","category":"page"},{"location":"translation/#Mutation","page":"Translation Guide","title":"Mutation","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"warning: Warning\nParallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. false sharing). You should carefully consider whether this is necessary or whether the use of task-local storage is the better option.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\ndata = rand(10)\n@threads for i in 1:10\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 1\ndata = rand(10)\ntforeach(data) do i\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 2\ndata = rand(10)\ntmap!(data, data) do i # this kind of aliasing is fine\n calc(i)\nend","category":"page"},{"location":"translation/#Parallel-initialization","page":"Translation Guide","title":"Parallel initialization","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\ndata = Vector{Float64}(undef, 10)\n@threads for i in 1:10\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 1\ndata = tmap(i->calc(i), 1:10)","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 2\ndata = tcollect(calc(i) for i in 1:10)","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"EditURL = \"tls.jl\"","category":"page"},{"location":"examples/tls/tls/#Task-Local-Storage","page":"Task-Local Storage","title":"Task-Local Storage","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"For some programs, it can be useful or even necessary to allocate and (re-)use memory in your parallel code. The following section uses a simple example to explain how task-local values can be efficiently created and (re-)used.","category":"page"},{"location":"examples/tls/tls/#Sequential","page":"Task-Local Storage","title":"Sequential","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Let's say that we are given two arrays of (square) matrices, As and Bs, and let's further assume that our goal is to compute the total sum of all pairwise matrix products. We can readily implement a (sequential) function that performs the necessary computations.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using LinearAlgebra: mul!, BLAS\nBLAS.set_num_threads(1) # for simplicity, we turn of OpenBLAS multithreading\n\nfunction matmulsums(As, Bs)\n N = size(first(As), 1)\n C = Matrix{Float64}(undef, N, N)\n map(As, Bs) do A, B\n mul!(C, A, B)\n sum(C)\n end\nend","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"matmulsums (generic function with 1 method)","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Here, we use map to perform the desired operation for each pair of matrices, A and B. However, the crucial point for our discussion is that we use the in-place matrix multiplication LinearAlgebra.mul! in conjunction with a pre-allocated output matrix C. This is to avoid the temporary allocation per \"iteration\" (i.e. per matrix pair) that we would get with C = A*B.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"For later comparison, we generate some random input data and store the result.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"As = [rand(1024, 1024) for _ in 1:64]\nBs = [rand(1024, 1024) for _ in 1:64]\n\nres = matmulsums(As, Bs);","category":"page"},{"location":"examples/tls/tls/#Parallelization","page":"Task-Local Storage","title":"Parallelization","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"The key idea for creating a parallel version of matmulsums is to replace the map by OhMyThreads' parallel tmap function. However, because we re-use C, this isn't entirely trivial.","category":"page"},{"location":"examples/tls/tls/#The-wrong-way","page":"Task-Local Storage","title":"The wrong way","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Someone new to parallel computing might be tempted to parallelize matmulsums like so:","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using OhMyThreads: tmap\n\nfunction matmulsums_race(As, Bs)\n N = size(first(As), 1)\n C = Matrix{Float64}(undef, N, N)\n tmap(As, Bs) do A, B\n mul!(C, A, B)\n sum(C)\n end\nend","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"matmulsums_race (generic function with 1 method)","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Unfortunately, this doesn't produce the correct result.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"res_race = matmulsums_race(As, Bs)\nres ≈ res_race","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"false","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"In fact, It doesn't even always produce the same result (check for yourself)! The reason is that there is a race condition: different parallel tasks are trying to use the shared variable C simultaneously leading to non-deterministic behavior. Let's see how we can fix this.","category":"page"},{"location":"examples/tls/tls/#The-naive-(and-inefficient)-way","page":"Task-Local Storage","title":"The naive (and inefficient) way","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"A simple solution for the race condition issue above is to move the allocation of C into the body of the parallel tmap:","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"function matmulsums_naive(As, Bs)\n N = size(first(As), 1)\n tmap(As, Bs) do A, B\n C = Matrix{Float64}(undef, N, N)\n mul!(C, A, B)\n sum(C)\n end\nend","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"matmulsums_naive (generic function with 1 method)","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"In this case, a separate C will be allocated for each iteration such that parallel tasks don't modify shared state anymore. Hence, we'll get the desired result.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"res_naive = matmulsums_naive(As, Bs)\nres ≈ res_naive","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"true","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"However, this variant is obviously inefficient because it is no better than just writing C = A*B and thus leads to one allocation per matrix pair. We need a different way of allocating and re-using C for an efficient parallel version.","category":"page"},{"location":"examples/tls/tls/#The-right-way:-TaskLocalValue","page":"Task-Local Storage","title":"The right way: TaskLocalValue","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"We've seen that we can't allocate C once up-front (→ race condition) and also shouldn't allocate it within the tmap (→ one allocation per iteration). What we actually want is to once allocate a separate C on each parallel task and then re-use this task-local C for all iterations (i.e. matrix pairs) that said task is responsible for.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"The way to express this idea is TaskLocalValue and looks like this:","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using OhMyThreads: TaskLocalValue\n\nfunction matmulsums_tls(As, Bs)\n N = size(first(As), 1)\n tls = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))\n tmap(As, Bs) do A, B\n C = tls[]\n mul!(C, A, B)\n sum(C)\n end\nend\n\nres_tls = matmulsums_tls(As, Bs)\nres ≈ res_tls","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"true","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Here, TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N)) defines a task-local storage tls that behaves like this: The first time the storage is accessed (tls[]) from a task a task-local value is created according to the anonymous function (here, the task-local value will be a matrix) and stored in the storage. Afterwards, every other storage query from the same task(!) will simply return the task-local value. Hence, this is precisely what we need and will only lead to O(# parallel tasks) allocations.","category":"page"},{"location":"examples/tls/tls/#The-performant-but-cumbersome-way","page":"Task-Local Storage","title":"The performant but cumbersome way","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"Before we benchmark and compare the performance of all discussed variants, let's implement the idea of a task-local C for each parallel task manually.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using OhMyThreads: chunks, @spawn\nusing Base.Threads: nthreads\n\nfunction matmulsums_manual(As, Bs)\n N = size(first(As), 1)\n tasks = map(chunks(As; n = nthreads())) do idcs\n @spawn begin\n local C = Matrix{Float64}(undef, N, N)\n local results = Vector{Float64}(undef, length(idcs))\n for (i, idx) in enumerate(idcs)\n mul!(C, As[idx], Bs[idx])\n results[i] = sum(C)\n end\n results\n end\n end\n reduce(vcat, fetch.(tasks))\nend\n\nres_manual = matmulsums_manual(As, Bs)\nres ≈ res_manual","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"true","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"The first thing to note is pretty obvious: This is very cumbersome and you probably don't want to write it. But let's take a closer look and see what's happening here. First, we divide the number of matrix pairs into nthreads() chunks. Then, for each of those chunks, we spawn a parallel task that (1) allocates a task-local C matrix (and a results vector) and (2) performs the actual computations using these pre-allocated values. Finally, we fetch the results of the tasks and combine them.","category":"page"},{"location":"examples/tls/tls/#Benchmark","page":"Task-Local Storage","title":"Benchmark","text":"","category":"section"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"The whole point of parallelization is increasing performance, so let's benchmark and compare the performance of the variants discussed above.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"using BenchmarkTools\n\n@btime matmulsums($As, $Bs);\n@btime matmulsums_naive($As, $Bs);\n@btime matmulsums_tls($As, $Bs);\n@btime matmulsums_manual($As, $Bs);","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":" 2.916 s (3 allocations: 8.00 MiB)\n 597.915 ms (174 allocations: 512.01 MiB)\n 575.507 ms (67 allocations: 40.01 MiB)\n 572.501 ms (49 allocations: 40.00 MiB)\n","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"As we see, the recommened version matmulsums_tls is both convenient as well as efficient: It allocates much less memory than matmulsums_naive and only slightly more than the manual implementation.","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"","category":"page"},{"location":"examples/tls/tls/","page":"Task-Local Storage","title":"Task-Local Storage","text":"This page was generated using Literate.jl.","category":"page"},{"location":"#OhMyThreads.jl","page":"OhMyThreads","title":"OhMyThreads.jl","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a focus on data parallelism, that can be used without having to worry much about manual Task creation.","category":"page"},{"location":"#Quick-Start","page":"OhMyThreads","title":"Quick Start","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The package is registered. Hence, you can simply use","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"] add OhMyThreads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"to add the package to your Julia environment.","category":"page"},{"location":"#Basic-example","page":"OhMyThreads","title":"Basic example","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"using OhMyThreads\n\nfunction mc_parallel(N; kw...)\n M = tmapreduce(+, 1:N; kw...) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\nmc_parallel(N) # gives, e.g., 3.14159924\n\nusing BenchmarkTools\n\n@show Threads.nthreads() # 5 in this example\n\n@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread\n@btime mc_parallel($N) # running with all 5 Julia threads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Timings might be something like this:","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"438.394 ms (7 allocations: 624 bytes)\n88.050 ms (37 allocations: 3.02 KiB)","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"(Check out the full Parallel Monte Carlo example if you like.)","category":"page"},{"location":"#No-Transducers","page":"OhMyThreads","title":"No Transducers","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.","category":"page"},{"location":"#Acknowledgements","page":"OhMyThreads","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"EditURL = \"integration.jl\"","category":"page"},{"location":"examples/integration/integration/#Trapezoidal-Integration","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"In this example, we want to parallelize the computation of a simple numerical integral via the trapezoidal rule. The latter is given by","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"int_a^bf(x)dx approx h sum_i=1^Nfracf(x_i-1)+f(x_i)2","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The function to be integrated is the following.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f(x) = 4 * √(1 - x^2)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"f (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"The analytic result of the definite integral (from 0 to 1) is known to be pi.","category":"page"},{"location":"examples/integration/integration/#Sequential","page":"Trapezoidal Integration","title":"Sequential","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Naturally, we implement the trapezoidal rule as a straightforward, sequential for loop.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"function trapezoidal(a, b, n; h = (b - a) / n)\n y = (f(a) + f(b)) / 2.0\n for i in 1:(n - 1)\n x = a + i * h\n y = y + f(x)\n end\n return y * h\nend","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Let's compute the integral of f above and see if we get the expected result. For simplicity, we choose N, the number of panels used to discretize the integration interval, as a multiple of the number of available Julia threads.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using Base.Threads: nthreads\n@show nthreads()\nN = nthreads() * 1_000_000","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"5000000","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Calling trapezoidal we do indeed find the (approximate) value of pi.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal(0, 1, N) ≈ π","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"examples/integration/integration/#Parallel","page":"Trapezoidal Integration","title":"Parallel","text":"","category":"section"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Our strategy is the following: Divide the integration interval among the available Julia threads. On each thread, use the sequential trapezoidal rule to compute the partial integral. It is straightforward to implement this strategy with tmapreduce. The map part is, essentially, the application of trapezoidal and the reduction operator is chosen to be + to sum up the local integrals.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using OhMyThreads\n\nfunction trapezoidal_parallel(a, b, N)\n n = N ÷ nthreads()\n h = (b - a) / N\n return tmapreduce(+, 1:nthreads()) do i\n local α = a + (i - 1) * n * h\n local β = α + n * h\n trapezoidal(α, β, n; h)\n end\nend","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel (generic function with 1 method)","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"First, we check the correctness of our parallel implementation.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"trapezoidal_parallel(0, 1, N) ≈ π","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"true","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Then, we benchmark and compare the performance of the sequential and parallel versions.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"using BenchmarkTools\n@btime trapezoidal(0, 1, $N);\n@btime trapezoidal_parallel(0, 1, $N);","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":" 13.871 ms (0 allocations: 0 bytes)\n 2.781 ms (38 allocations: 3.19 KiB)\n","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"","category":"page"},{"location":"examples/integration/integration/","page":"Trapezoidal Integration","title":"Trapezoidal Integration","text":"This page was generated using Literate.jl.","category":"page"}] } diff --git a/previews/PR38/translation/index.html b/previews/PR38/translation/index.html index 7b8604e7..e7f9d12e 100644 --- a/previews/PR38/translation/index.html +++ b/previews/PR38/translation/index.html @@ -43,4 +43,4 @@ data[i] = calc(i) end
# OhMyThreads: Variant 1
 data = tmap(i->calc(i), 1:10)
# OhMyThreads: Variant 2
-data = tcollect(calc(i) for i in 1:10)
+data = tcollect(calc(i) for i in 1:10)