diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 3d552c25..fa73bac4 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-01T19:36:19","documenter_version":"1.2.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-02T10:33:03","documenter_version":"1.2.1"}} \ No newline at end of file diff --git a/dev/examples/juliaset/juliaset/index.html b/dev/examples/juliaset/juliaset/index.html index caac8b6d..f0142be9 100644 --- a/dev/examples/juliaset/juliaset/index.html +++ b/dev/examples/juliaset/juliaset/index.html @@ -52,4 +52,4 @@ 63.707 ms (39 allocations: 3.30 KiB)

As hoped, the parallel implementation is faster. But can we improve the performance further?

Tuning nchunks

As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.

@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;
  32.000 ms (12013 allocations: 1.14 MiB)
 

Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).

@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;
  63.439 ms (42 allocations: 3.37 KiB)
-

This page was generated using Literate.jl.

+

This page was generated using Literate.jl.

diff --git a/dev/examples/mc/mc.jl b/dev/examples/mc/mc.jl index fe520fc2..d902bfd6 100644 --- a/dev/examples/mc/mc.jl +++ b/dev/examples/mc/mc.jl @@ -65,7 +65,7 @@ using Base.Threads: nthreads using OhMyThreads: @spawn -function mc_parallel_manual(N; nchunks=nthreads()) +function mc_parallel_manual(N; nchunks = nthreads()) tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready @spawn mc(length(idcs)) end @@ -78,3 +78,15 @@ mc_parallel_manual(N) # And this is the performance: @btime mc_parallel_manual($N) samples=10 evals=3; + +# It is faster than `mc_parallel` above because the task-local computation +# `mc(length(idcs))` is faster than the implicit task-local computation within +# `tmapreduce` (which itself is a `mapreduce`). + +idcs = first(chunks(1:N; n = nthreads())) + +@btime mapreduce($+, $idcs) do i + rand()^2 + rand()^2 < 1.0 +end samples=10 evals=3; + +@btime mc($(length(idcs))) samples=10 evals=3; diff --git a/dev/examples/mc/mc/index.html b/dev/examples/mc/mc/index.html index 87e111a0..3c0715e3 100644 --- a/dev/examples/mc/mc/index.html +++ b/dev/examples/mc/mc/index.html @@ -12,7 +12,7 @@ N = 100_000_000 -mc(N)
3.14169568

Parallelization with tmapreduce

To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and "throw one dart" per element.

using OhMyThreads
+mc(N)
3.141517

Parallelization with tmapreduce

To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and "throw one dart" per element.

using OhMyThreads
 
 function mc_parallel(N)
     M = tmapreduce(+, 1:N) do i
@@ -22,7 +22,7 @@
     return pi
 end
 
-mc_parallel(N)
3.14169096

Let's run a quick benchmark.

using BenchmarkTools
+mc_parallel(N)
3.14159924

Let's run a quick benchmark.

using BenchmarkTools
 using Base.Threads: nthreads
 
 @assert nthreads() > 1 # make sure we have multiple Julia threads
@@ -30,11 +30,11 @@
 
 @btime mc($N) samples=10 evals=3;
 @btime mc_parallel($N) samples=10 evals=3;
nthreads() = 5
-  317.421 ms (0 allocations: 0 bytes)
-  88.188 ms (37 allocations: 3.02 KiB)
+  318.467 ms (0 allocations: 0 bytes)
+  88.553 ms (37 allocations: 3.02 KiB)
 

Manual parallelization

First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for $\pi$.

using OhMyThreads: @spawn
 
-function mc_parallel_manual(N; nchunks=nthreads())
+function mc_parallel_manual(N; nchunks = nthreads())
     tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready
         @spawn mc(length(idcs))
     end
@@ -42,5 +42,13 @@
     return pi
 end
 
-mc_parallel_manual(N)
3.1414561999999995

And this is the performance:

@btime mc_parallel_manual($N) samples=10 evals=3;
  63.512 ms (31 allocations: 2.80 KiB)
-

This page was generated using Literate.jl.

+mc_parallel_manual(N)
3.1415844

And this is the performance:

@btime mc_parallel_manual($N) samples=10 evals=3;
  63.825 ms (31 allocations: 2.80 KiB)
+

It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).

idcs = first(chunks(1:N; n = nthreads()))
+
+@btime mapreduce($+, $idcs) do i
+    rand()^2 + rand()^2 < 1.0
+end samples=10 evals=3;
+
+@btime mc($(length(idcs))) samples=10 evals=3;
  87.617 ms (0 allocations: 0 bytes)
+  63.398 ms (0 allocations: 0 bytes)
+

This page was generated using Literate.jl.

diff --git a/dev/examples/tomarkdown.sh b/dev/examples/tomarkdown.sh index dad11c37..b6ee42d5 100755 --- a/dev/examples/tomarkdown.sh +++ b/dev/examples/tomarkdown.sh @@ -10,9 +10,16 @@ const repourl = "https://github.com/JuliaFolds2/OhMyThreads.jl/blob/main/docs" using Literate using Pkg -dirs = filter(isdir, readdir()) -if length(ARGS) > 0 - dirs = ARGS +if length(ARGS) == 0 + println("Error: Please provide the folder names of the examples you want to compile to markdown. " * + "Alternatively, you can pass \"all\" as the first argument to compile them all.") + exit() +else + if first(ARGS) == "all" + dirs = filter(isdir, readdir()) + else + dirs = ARGS + end end @show dirs diff --git a/dev/index.html b/dev/index.html index 69c65d55..adbd7be4 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -OhMyThreads · OhMyThreads.jl

OhMyThreads.jl

OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations via higher-order functions, with a focus on data parallelism without needing to expose julia's Task model to users.

Installation

The package is registered. Hence, you can simply use

] add OhMyThreads

to add the package to your Julia environment.

Noteworthy Alternatives

Acknowledgements

The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.

+OhMyThreads · OhMyThreads.jl

OhMyThreads.jl

OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations via higher-order functions, with a focus on data parallelism without needing to expose julia's Task model to users.

Installation

The package is registered. Hence, you can simply use

] add OhMyThreads

to add the package to your Julia environment.

Noteworthy Alternatives

Acknowledgements

The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.

diff --git a/dev/refs/api/index.html b/dev/refs/api/index.html index b0cd1cb3..4312d6b5 100644 --- a/dev/refs/api/index.html +++ b/dev/refs/api/index.html @@ -4,27 +4,27 @@ nchunks::Int = nthreads(), split::Symbol = :batch, schedule::Symbol =:dynamic, - outputtype::Type = Any)

A multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of

 (√1 + √2) + (√3 + √4) + √5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.treduceFunction
treduce(op, A::AbstractArray...;
+           outputtype::Type = Any)

A multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of

 (√1 + √2) + (√3 + √4) + √5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.treduceFunction
treduce(op, A::AbstractArray...;
         [init],
         nchunks::Int = nthreads(),
         split::Symbol = :batch,
         schedule::Symbol =:dynamic,
-        outputtype::Type = Any)

A multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of reduce, sum(A) is equivalent to reduce(+, A). Doing

 treduce(+, [1, 2, 3, 4, 5])

is the parallelized version of

 (1 + 2) + (3 + 4) + 5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.tmapFunction
tmap(f, [OutputElementType], A::AbstractArray...; 
+        outputtype::Type = Any)

A multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of reduce, sum(A) is equivalent to reduce(+, A). Doing

 treduce(+, [1, 2, 3, 4, 5])

is the parallelized version of

 (1 + 2) + (3 + 4) + 5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.tmapFunction
tmap(f, [OutputElementType], A::AbstractArray...; 
      nchunks::Int = nthreads(),
      split::Symbol = :batch,
-     schedule::Symbol =:dynamic)

A multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel on multiple tasks. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tmap!Function
tmap!(f, out, A::AbstractArray...;
+     schedule::Symbol =:dynamic)

A multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel on multiple tasks. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tmap!Function
tmap!(f, out, A::AbstractArray...;
       nchunks::Int = nthreads(),
       split::Symbol = :batch,
-      schedule::Symbol =:dynamic)

A multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tforeachFunction
tforeach(f, A::AbstractArray...;
+      schedule::Symbol =:dynamic)

A multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tforeachFunction
tforeach(f, A::AbstractArray...;
          nchunks::Int = nthreads(),
          split::Symbol = :batch,
          schedule::Symbol =:dynamic) :: Nothing

A multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing, i.e. it is the parallel equivalent of

for x in A
     f(x)
-end

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tcollectFunction
tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
+end

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tcollectFunction
tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
          nchunks::Int = nthreads(),
-         schedule::Symbol =:dynamic)

A multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.treducemapFunction
treducemap(op, f, A::AbstractArray...;
+         schedule::Symbol =:dynamic)

A multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.treducemapFunction
treducemap(op, f, A::AbstractArray...;
            [init],
            nchunks::Int = nthreads(),
            split::Symbol = :batch,
            schedule::Symbol =:dynamic,
-           outputtype::Type = Any)

Like tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing

 treducemap(+, √, [1, 2, 3, 4, 5])

is the parallelized version of

 (√1 + √2) + (√3 + √4) + √5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source

as well as the following re-exported functions:

chunkssee ChunkSplitters.jl

Non-Exported

OhMyThreads.@spawnsee StableTasks.jl
OhMyThreads.@spawnatsee StableTasks.jl
+ outputtype::Type = Any)

Like tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing

 treducemap(+, √, [1, 2, 3, 4, 5])

is the parallelized version of

 (√1 + √2) + (√3 + √4) + √5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source

as well as the following re-exported functions:

chunkssee ChunkSplitters.jl

Non-Exported

OhMyThreads.@spawnsee StableTasks.jl
OhMyThreads.@spawnatsee StableTasks.jl
diff --git a/dev/refs/internal/index.html b/dev/refs/internal/index.html index 3d0486a6..9cbfe512 100644 --- a/dev/refs/internal/index.html +++ b/dev/refs/internal/index.html @@ -1,2 +1,2 @@ -Internal · OhMyThreads.jl
+Internal · OhMyThreads.jl
diff --git a/dev/search_index.js b/dev/search_index.js index 53e83e1e..3222671a 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"EditURL = \"juliaset.jl\"","category":"page"},{"location":"examples/juliaset/juliaset/#Julia-Set","page":"Julia Set","title":"Julia Set","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)\n x = -2.0 + (j - 1) * 4.0 / (n - 1)\n y = -2.0 + (i - 1) * 4.0 / (n - 1)\n\n z = x + y * im\n iter = max_iter\n for k in 1:max_iter\n if abs2(z) > 4.0\n iter = k - 1\n break\n end\n z = z^2 + c\n end\n return iter\nend","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"_compute_pixel (generic function with 1 method)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.","category":"page"},{"location":"examples/juliaset/juliaset/#Sequential-computation","page":"Julia Set","title":"Sequential computation","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function compute_juliaset_sequential!(img)\n N = size(img, 1)\n for j in 1:N\n for i in 1:N\n img[i, j] = _compute_pixel(i, j, N)\n end\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_sequential!(img);","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's look at the result","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using Plots\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Parallelization","page":"Julia Set","title":"Parallelization","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: tmap!\n\nfunction compute_juliaset_parallel!(img; kwargs...)\n N = size(img, 1)\n cart = CartesianIndices(img)\n tmap!(img, eachindex(img); kwargs...) do idx\n c = cart[idx]\n _compute_pixel(c[1], c[2], N)\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_parallel!(img);\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Benchmark","page":"Julia Set","title":"Benchmark","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's benchmark the variants above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\nN = 2000\nimg = zeros(Int, N, N)\n\n@btime compute_juliaset_sequential!($img) samples=10 evals=3;\n@btime compute_juliaset_parallel!($img) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 138.377 ms (0 allocations: 0 bytes)\n 63.707 ms (39 allocations: 3.30 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As hoped, the parallel implementation is faster. But can we improve the performance further?","category":"page"},{"location":"examples/juliaset/juliaset/#Tuning-nchunks","page":"Julia Set","title":"Tuning nchunks","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 32.000 ms (12013 allocations: 1.14 MiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 63.439 ms (42 allocations: 3.37 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"This page was generated using Literate.jl.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"EditURL = \"mc.jl\"","category":"page"},{"location":"examples/mc/mc/#Parallel-Monte-Carlo","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Calculate the value of pi through parallel direct Monte Carlo.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is pi, the area of the square is 4, and the ratio is pi4. This means that, if you throw N darts randomly at the square, approximately M=Npi4 of those darts will land inside the unit circle.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Throw darts randomly at a unit square and count how many of them (M) landed inside of a unit circle. Approximate pi approx 4MN.","category":"page"},{"location":"examples/mc/mc/#Sequential-implementation:","page":"Parallel Monte Carlo","title":"Sequential implementation:","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"function mc(N)\n M = 0 # number of darts that landed in the circle\n for i in 1:N\n if rand()^2 + rand()^2 < 1.0\n M += 1\n end\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\n\nmc(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14169568","category":"page"},{"location":"examples/mc/mc/#Parallelization-with-tmapreduce","page":"Parallel Monte Carlo","title":"Parallelization with tmapreduce","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and \"throw one dart\" per element.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads\n\nfunction mc_parallel(N)\n M = tmapreduce(+, 1:N) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nmc_parallel(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14169096","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Let's run a quick benchmark.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\n@assert nthreads() > 1 # make sure we have multiple Julia threads\n@show nthreads() # print out the number of threads\n\n@btime mc($N) samples=10 evals=3;\n@btime mc_parallel($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"nthreads() = 5\n 317.421 ms (0 allocations: 0 bytes)\n 88.188 ms (37 allocations: 3.02 KiB)\n","category":"page"},{"location":"examples/mc/mc/#Manual-parallelization","page":"Parallel Monte Carlo","title":"Manual parallelization","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for pi.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: @spawn\n\nfunction mc_parallel_manual(N; nchunks=nthreads())\n tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready\n @spawn mc(length(idcs))\n end\n pi = sum(fetch, tasks) / nchunks\n return pi\nend\n\nmc_parallel_manual(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.1414561999999995","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"And this is the performance:","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"@btime mc_parallel_manual($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 63.512 ms (31 allocations: 2.80 KiB)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/internal/#Internal","page":"Internal","title":"Internal","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"The following is internal, i.e. not public API, and might change at any point.","category":"page"},{"location":"refs/internal/#Index","page":"Internal","title":"Index","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Pages = [\"internal.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/internal/#References","page":"Internal","title":"References","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Modules = [OhMyThreads, OhMyThreads.Tools]\nPublic = false\nPages = [\"OhMyThreads.jl\", \"tools.jl\"]","category":"page"},{"location":"refs/internal/#OhMyThreads.Tools.nthtid-Tuple{Any}","page":"Internal","title":"OhMyThreads.Tools.nthtid","text":"Returns the thread id of the nth Julia thread in the :default threadpool.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.taskid-Tuple{}","page":"Internal","title":"OhMyThreads.Tools.taskid","text":"taskid() :: UInt\n\nReturn a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist. \n\n\n\n\n\n","category":"method"},{"location":"refs/api/#Public-API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"refs/api/#Index","page":"Public API","title":"Index","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"Pages = [\"api.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/api/#Exported","page":"Public API","title":"Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"tmapreduce\ntreduce\ntmap\ntmap!\ntforeach\ntcollect\ntreducemap","category":"page"},{"location":"refs/api/#OhMyThreads.tmapreduce","page":"Public API","title":"OhMyThreads.tmapreduce","text":"tmapreduce(f, op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (√1 + √2) + (√3 + √4) + √5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treduce","page":"Public API","title":"OhMyThreads.treduce","text":"treduce(op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of reduce, sum(A) is equivalent to reduce(+, A). Doing\n\n treduce(+, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (1 + 2) + (3 + 4) + 5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap","page":"Public API","title":"OhMyThreads.tmap","text":"tmap(f, [OutputElementType], A::AbstractArray...; \n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel on multiple tasks. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap!","page":"Public API","title":"OhMyThreads.tmap!","text":"tmap!(f, out, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tforeach","page":"Public API","title":"OhMyThreads.tforeach","text":"tforeach(f, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic) :: Nothing\n\nA multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing, i.e. it is the parallel equivalent of\n\nfor x in A\n f(x)\nend\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tcollect","page":"Public API","title":"OhMyThreads.tcollect","text":"tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};\n nchunks::Int = nthreads(),\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treducemap","page":"Public API","title":"OhMyThreads.treducemap","text":"treducemap(op, f, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nLike tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing\n\n treducemap(+, √, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (√1 + √2) + (√3 + √4) + √5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"as well as the following re-exported functions:","category":"page"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nchunks see ChunkSplitters.jl","category":"page"},{"location":"refs/api/#Non-Exported","page":"Public API","title":"Non-Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.@spawn see StableTasks.jl\nOhMyThreads.@spawnat see StableTasks.jl","category":"page"},{"location":"#OhMyThreads.jl","page":"OhMyThreads","title":"OhMyThreads.jl","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations via higher-order functions, with a focus on data parallelism without needing to expose julia's Task model to users.","category":"page"},{"location":"#Installation","page":"OhMyThreads","title":"Installation","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The package is registered. Hence, you can simply use","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"] add OhMyThreads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"to add the package to your Julia environment.","category":"page"},{"location":"#Noteworthy-Alternatives","page":"OhMyThreads","title":"Noteworthy Alternatives","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"ThreadsX.jl\nFolds.jl","category":"page"},{"location":"#Acknowledgements","page":"OhMyThreads","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.","category":"page"}] +[{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"EditURL = \"juliaset.jl\"","category":"page"},{"location":"examples/juliaset/juliaset/#Julia-Set","page":"Julia Set","title":"Julia Set","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)\n x = -2.0 + (j - 1) * 4.0 / (n - 1)\n y = -2.0 + (i - 1) * 4.0 / (n - 1)\n\n z = x + y * im\n iter = max_iter\n for k in 1:max_iter\n if abs2(z) > 4.0\n iter = k - 1\n break\n end\n z = z^2 + c\n end\n return iter\nend","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"_compute_pixel (generic function with 1 method)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.","category":"page"},{"location":"examples/juliaset/juliaset/#Sequential-computation","page":"Julia Set","title":"Sequential computation","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function compute_juliaset_sequential!(img)\n N = size(img, 1)\n for j in 1:N\n for i in 1:N\n img[i, j] = _compute_pixel(i, j, N)\n end\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_sequential!(img);","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's look at the result","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using Plots\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Parallelization","page":"Julia Set","title":"Parallelization","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: tmap!\n\nfunction compute_juliaset_parallel!(img; kwargs...)\n N = size(img, 1)\n cart = CartesianIndices(img)\n tmap!(img, eachindex(img); kwargs...) do idx\n c = cart[idx]\n _compute_pixel(c[1], c[2], N)\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_parallel!(img);\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Benchmark","page":"Julia Set","title":"Benchmark","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's benchmark the variants above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\nN = 2000\nimg = zeros(Int, N, N)\n\n@btime compute_juliaset_sequential!($img) samples=10 evals=3;\n@btime compute_juliaset_parallel!($img) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 138.377 ms (0 allocations: 0 bytes)\n 63.707 ms (39 allocations: 3.30 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As hoped, the parallel implementation is faster. But can we improve the performance further?","category":"page"},{"location":"examples/juliaset/juliaset/#Tuning-nchunks","page":"Julia Set","title":"Tuning nchunks","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 32.000 ms (12013 allocations: 1.14 MiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 63.439 ms (42 allocations: 3.37 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"This page was generated using Literate.jl.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"EditURL = \"mc.jl\"","category":"page"},{"location":"examples/mc/mc/#Parallel-Monte-Carlo","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Calculate the value of pi through parallel direct Monte Carlo.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is pi, the area of the square is 4, and the ratio is pi4. This means that, if you throw N darts randomly at the square, approximately M=Npi4 of those darts will land inside the unit circle.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Throw darts randomly at a unit square and count how many of them (M) landed inside of a unit circle. Approximate pi approx 4MN.","category":"page"},{"location":"examples/mc/mc/#Sequential-implementation:","page":"Parallel Monte Carlo","title":"Sequential implementation:","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"function mc(N)\n M = 0 # number of darts that landed in the circle\n for i in 1:N\n if rand()^2 + rand()^2 < 1.0\n M += 1\n end\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\n\nmc(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.141517","category":"page"},{"location":"examples/mc/mc/#Parallelization-with-tmapreduce","page":"Parallel Monte Carlo","title":"Parallelization with tmapreduce","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and \"throw one dart\" per element.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads\n\nfunction mc_parallel(N)\n M = tmapreduce(+, 1:N) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nmc_parallel(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14159924","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Let's run a quick benchmark.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\n@assert nthreads() > 1 # make sure we have multiple Julia threads\n@show nthreads() # print out the number of threads\n\n@btime mc($N) samples=10 evals=3;\n@btime mc_parallel($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"nthreads() = 5\n 318.467 ms (0 allocations: 0 bytes)\n 88.553 ms (37 allocations: 3.02 KiB)\n","category":"page"},{"location":"examples/mc/mc/#Manual-parallelization","page":"Parallel Monte Carlo","title":"Manual parallelization","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for pi.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: @spawn\n\nfunction mc_parallel_manual(N; nchunks = nthreads())\n tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready\n @spawn mc(length(idcs))\n end\n pi = sum(fetch, tasks) / nchunks\n return pi\nend\n\nmc_parallel_manual(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.1415844","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"And this is the performance:","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"@btime mc_parallel_manual($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 63.825 ms (31 allocations: 2.80 KiB)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"idcs = first(chunks(1:N; n = nthreads()))\n\n@btime mapreduce($+, $idcs) do i\n rand()^2 + rand()^2 < 1.0\nend samples=10 evals=3;\n\n@btime mc($(length(idcs))) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 87.617 ms (0 allocations: 0 bytes)\n 63.398 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/internal/#Internal","page":"Internal","title":"Internal","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"The following is internal, i.e. not public API, and might change at any point.","category":"page"},{"location":"refs/internal/#Index","page":"Internal","title":"Index","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Pages = [\"internal.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/internal/#References","page":"Internal","title":"References","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Modules = [OhMyThreads, OhMyThreads.Tools]\nPublic = false\nPages = [\"OhMyThreads.jl\", \"tools.jl\"]","category":"page"},{"location":"refs/internal/#OhMyThreads.Tools.nthtid-Tuple{Any}","page":"Internal","title":"OhMyThreads.Tools.nthtid","text":"Returns the thread id of the nth Julia thread in the :default threadpool.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.taskid-Tuple{}","page":"Internal","title":"OhMyThreads.Tools.taskid","text":"taskid() :: UInt\n\nReturn a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist. \n\n\n\n\n\n","category":"method"},{"location":"refs/api/#Public-API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"refs/api/#Index","page":"Public API","title":"Index","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"Pages = [\"api.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/api/#Exported","page":"Public API","title":"Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"tmapreduce\ntreduce\ntmap\ntmap!\ntforeach\ntcollect\ntreducemap","category":"page"},{"location":"refs/api/#OhMyThreads.tmapreduce","page":"Public API","title":"OhMyThreads.tmapreduce","text":"tmapreduce(f, op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (√1 + √2) + (√3 + √4) + √5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treduce","page":"Public API","title":"OhMyThreads.treduce","text":"treduce(op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of reduce, sum(A) is equivalent to reduce(+, A). Doing\n\n treduce(+, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (1 + 2) + (3 + 4) + 5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap","page":"Public API","title":"OhMyThreads.tmap","text":"tmap(f, [OutputElementType], A::AbstractArray...; \n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel on multiple tasks. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap!","page":"Public API","title":"OhMyThreads.tmap!","text":"tmap!(f, out, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tforeach","page":"Public API","title":"OhMyThreads.tforeach","text":"tforeach(f, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic) :: Nothing\n\nA multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing, i.e. it is the parallel equivalent of\n\nfor x in A\n f(x)\nend\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tcollect","page":"Public API","title":"OhMyThreads.tcollect","text":"tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};\n nchunks::Int = nthreads(),\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treducemap","page":"Public API","title":"OhMyThreads.treducemap","text":"treducemap(op, f, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nLike tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing\n\n treducemap(+, √, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (√1 + √2) + (√3 + √4) + √5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"as well as the following re-exported functions:","category":"page"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nchunks see ChunkSplitters.jl","category":"page"},{"location":"refs/api/#Non-Exported","page":"Public API","title":"Non-Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.@spawn see StableTasks.jl\nOhMyThreads.@spawnat see StableTasks.jl","category":"page"},{"location":"#OhMyThreads.jl","page":"OhMyThreads","title":"OhMyThreads.jl","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations via higher-order functions, with a focus on data parallelism without needing to expose julia's Task model to users.","category":"page"},{"location":"#Installation","page":"OhMyThreads","title":"Installation","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The package is registered. Hence, you can simply use","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"] add OhMyThreads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"to add the package to your Julia environment.","category":"page"},{"location":"#Noteworthy-Alternatives","page":"OhMyThreads","title":"Noteworthy Alternatives","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"ThreadsX.jl\nFolds.jl","category":"page"},{"location":"#Acknowledgements","page":"OhMyThreads","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.","category":"page"}] }