From 2146403e340a3af8655974093bbcf8a55830a67d Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Fri, 2 Feb 2024 13:38:18 +0000 Subject: [PATCH] build based on b1e25f8 --- dev/.documenter-siteinfo.json | 2 +- dev/examples/juliaset/juliaset/index.html | 4 +- dev/examples/mc/mc/index.html | 4 +- dev/index.html | 4 +- dev/refs/api/index.html | 16 ++++---- dev/refs/internal/index.html | 2 +- dev/search_index.js | 2 +- dev/translation/index.html | 46 +++++++++++++++++++++++ 8 files changed, 63 insertions(+), 17 deletions(-) create mode 100644 dev/translation/index.html diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 41d8f1d5..70f942e7 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-02T12:29:18","documenter_version":"1.2.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-02T13:38:15","documenter_version":"1.2.1"}} \ No newline at end of file diff --git a/dev/examples/juliaset/juliaset/index.html b/dev/examples/juliaset/juliaset/index.html index e837f182..cc02e1a3 100644 --- a/dev/examples/juliaset/juliaset/index.html +++ b/dev/examples/juliaset/juliaset/index.html @@ -1,5 +1,5 @@ -Julia Set · OhMyThreads.jl

Julia Set

In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.

The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.

function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)
+Julia Set · OhMyThreads.jl

Julia Set

In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.

The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.

function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)
     x = -2.0 + (j - 1) * 4.0 / (n - 1)
     y = -2.0 + (i - 1) * 4.0 / (n - 1)
 
@@ -52,4 +52,4 @@
   63.707 ms (39 allocations: 3.30 KiB)
 

As hoped, the parallel implementation is faster. But can we improve the performance further?

Tuning nchunks

As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.

@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;
  32.000 ms (12013 allocations: 1.14 MiB)
 

Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).

@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;
  63.439 ms (42 allocations: 3.37 KiB)
-

This page was generated using Literate.jl.

+

This page was generated using Literate.jl.

diff --git a/dev/examples/mc/mc/index.html b/dev/examples/mc/mc/index.html index 0e6c72c6..f38c3770 100644 --- a/dev/examples/mc/mc/index.html +++ b/dev/examples/mc/mc/index.html @@ -1,5 +1,5 @@ -Parallel Monte Carlo · OhMyThreads.jl

Parallel Monte Carlo

Calculate the value of $\pi$ through parallel direct Monte Carlo.

A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is $\pi$, the area of the square is 4, and the ratio is $\pi/4$. This means that, if you throw $N$ darts randomly at the square, approximately $M=N\pi/4$ of those darts will land inside the unit circle.

Throw darts randomly at a unit square and count how many of them ($M$) landed inside of a unit circle. Approximate $\pi \approx 4M/N$.

Sequential implementation:

function mc(N)
+Parallel Monte Carlo · OhMyThreads.jl

Parallel Monte Carlo

Calculate the value of $\pi$ through parallel direct Monte Carlo.

A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is $\pi$, the area of the square is 4, and the ratio is $\pi/4$. This means that, if you throw $N$ darts randomly at the square, approximately $M=N\pi/4$ of those darts will land inside the unit circle.

Throw darts randomly at a unit square and count how many of them ($M$) landed inside of a unit circle. Approximate $\pi \approx 4M/N$.

Sequential implementation:

function mc(N)
     M = 0 # number of darts that landed in the circle
     for i in 1:N
         if rand()^2 + rand()^2 < 1.0
@@ -51,4 +51,4 @@
 
 @btime mc($(length(idcs))) samples=10 evals=3;
  87.617 ms (0 allocations: 0 bytes)
   63.398 ms (0 allocations: 0 bytes)
-

This page was generated using Literate.jl.

+

This page was generated using Literate.jl.

diff --git a/dev/index.html b/dev/index.html index b7f67133..2b71bf2d 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,5 +1,5 @@ -OhMyThreads · OhMyThreads.jl

OhMyThreads.jl

OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a focus on data parallelism, that can be used without having to worry much about manual Task creation.

Quick Start

The package is registered. Hence, you can simply use

] add OhMyThreads

to add the package to your Julia environment.

Basic example

using OhMyThreads
+OhMyThreads · OhMyThreads.jl

OhMyThreads.jl

OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a focus on data parallelism, that can be used without having to worry much about manual Task creation.

Quick Start

The package is registered. Hence, you can simply use

] add OhMyThreads

to add the package to your Julia environment.

Basic example

using OhMyThreads
 
 function mc_parallel(N; kw...)
     M = tmapreduce(+, 1:N; kw...) do i
@@ -18,4 +18,4 @@
 
 @btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread
 @btime mc_parallel($N)            # running with all 5 Julia threads

Timings might be something like this:

438.394 ms (7 allocations: 624 bytes)
-88.050 ms (37 allocations: 3.02 KiB)

(Check out the full Parallel Monte Carlo example if you like.)

No Transducers

Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.

Acknowledgements

The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.

+88.050 ms (37 allocations: 3.02 KiB)

(Check out the full Parallel Monte Carlo example if you like.)

No Transducers

Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.

Acknowledgements

The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.

diff --git a/dev/refs/api/index.html b/dev/refs/api/index.html index 41843d87..2e9ac863 100644 --- a/dev/refs/api/index.html +++ b/dev/refs/api/index.html @@ -1,30 +1,30 @@ -Public API · OhMyThreads.jl

Public API

Index

Exported

OhMyThreads.tmapreduceFunction
tmapreduce(f, op, A::AbstractArray...;
+Public API · OhMyThreads.jl

Public API

Index

Exported

OhMyThreads.tmapreduceFunction
tmapreduce(f, op, A::AbstractArray...;
            [init],
            nchunks::Int = nthreads(),
            split::Symbol = :batch,
            schedule::Symbol =:dynamic,
-           outputtype::Type = Any)

A multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of

 (√1 + √2) + (√3 + √4) + √5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.treduceFunction
treduce(op, A::AbstractArray...;
+           outputtype::Type = Any)

A multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of

 (√1 + √2) + (√3 + √4) + √5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.treduceFunction
treduce(op, A::AbstractArray...;
         [init],
         nchunks::Int = nthreads(),
         split::Symbol = :batch,
         schedule::Symbol =:dynamic,
-        outputtype::Type = Any)

A multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of reduce, sum(A) is equivalent to reduce(+, A). Doing

 treduce(+, [1, 2, 3, 4, 5])

is the parallelized version of

 (1 + 2) + (3 + 4) + 5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.tmapFunction
tmap(f, [OutputElementType], A::AbstractArray...; 
+        outputtype::Type = Any)

A multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of reduce, sum(A) is equivalent to reduce(+, A). Doing

 treduce(+, [1, 2, 3, 4, 5])

is the parallelized version of

 (1 + 2) + (3 + 4) + 5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.tmapFunction
tmap(f, [OutputElementType], A::AbstractArray...; 
      nchunks::Int = nthreads(),
      split::Symbol = :batch,
-     schedule::Symbol =:dynamic)

A multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel on multiple tasks. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tmap!Function
tmap!(f, out, A::AbstractArray...;
+     schedule::Symbol =:dynamic)

A multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel on multiple tasks. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tmap!Function
tmap!(f, out, A::AbstractArray...;
       nchunks::Int = nthreads(),
       split::Symbol = :batch,
-      schedule::Symbol =:dynamic)

A multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tforeachFunction
tforeach(f, A::AbstractArray...;
+      schedule::Symbol =:dynamic)

A multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tforeachFunction
tforeach(f, A::AbstractArray...;
          nchunks::Int = nthreads(),
          split::Symbol = :batch,
          schedule::Symbol =:dynamic) :: Nothing

A multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing, i.e. it is the parallel equivalent of

for x in A
     f(x)
-end

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tcollectFunction
tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
+end

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tcollectFunction
tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
          nchunks::Int = nthreads(),
-         schedule::Symbol =:dynamic)

A multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.treducemapFunction
treducemap(op, f, A::AbstractArray...;
+         schedule::Symbol =:dynamic)

A multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.treducemapFunction
treducemap(op, f, A::AbstractArray...;
            [init],
            nchunks::Int = nthreads(),
            split::Symbol = :batch,
            schedule::Symbol =:dynamic,
-           outputtype::Type = Any)

Like tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing

 treducemap(+, √, [1, 2, 3, 4, 5])

is the parallelized version of

 (√1 + √2) + (√3 + √4) + √5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source

as well as the following re-exported functions:

chunkssee ChunkSplitters.jl

Non-Exported

OhMyThreads.@spawnsee StableTasks.jl
OhMyThreads.@spawnatsee StableTasks.jl
+ outputtype::Type = Any)

Like tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing

 treducemap(+, √, [1, 2, 3, 4, 5])

is the parallelized version of

 (√1 + √2) + (√3 + √4) + √5

This data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source

as well as the following re-exported functions:

chunkssee ChunkSplitters.jl

Non-Exported

OhMyThreads.@spawnsee StableTasks.jl
OhMyThreads.@spawnatsee StableTasks.jl
diff --git a/dev/refs/internal/index.html b/dev/refs/internal/index.html index 929d7879..638b51af 100644 --- a/dev/refs/internal/index.html +++ b/dev/refs/internal/index.html @@ -1,2 +1,2 @@ -Internal · OhMyThreads.jl
+Internal · OhMyThreads.jl
diff --git a/dev/search_index.js b/dev/search_index.js index c5c73e4c..75cf80b2 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"EditURL = \"juliaset.jl\"","category":"page"},{"location":"examples/juliaset/juliaset/#Julia-Set","page":"Julia Set","title":"Julia Set","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)\n x = -2.0 + (j - 1) * 4.0 / (n - 1)\n y = -2.0 + (i - 1) * 4.0 / (n - 1)\n\n z = x + y * im\n iter = max_iter\n for k in 1:max_iter\n if abs2(z) > 4.0\n iter = k - 1\n break\n end\n z = z^2 + c\n end\n return iter\nend","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"_compute_pixel (generic function with 1 method)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.","category":"page"},{"location":"examples/juliaset/juliaset/#Sequential-computation","page":"Julia Set","title":"Sequential computation","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function compute_juliaset_sequential!(img)\n N = size(img, 1)\n for j in 1:N\n for i in 1:N\n img[i, j] = _compute_pixel(i, j, N)\n end\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_sequential!(img);","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's look at the result","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using Plots\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Parallelization","page":"Julia Set","title":"Parallelization","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: tmap!\n\nfunction compute_juliaset_parallel!(img; kwargs...)\n N = size(img, 1)\n cart = CartesianIndices(img)\n tmap!(img, eachindex(img); kwargs...) do idx\n c = cart[idx]\n _compute_pixel(c[1], c[2], N)\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_parallel!(img);\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Benchmark","page":"Julia Set","title":"Benchmark","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's benchmark the variants above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\nN = 2000\nimg = zeros(Int, N, N)\n\n@btime compute_juliaset_sequential!($img) samples=10 evals=3;\n@btime compute_juliaset_parallel!($img) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 138.377 ms (0 allocations: 0 bytes)\n 63.707 ms (39 allocations: 3.30 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As hoped, the parallel implementation is faster. But can we improve the performance further?","category":"page"},{"location":"examples/juliaset/juliaset/#Tuning-nchunks","page":"Julia Set","title":"Tuning nchunks","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 32.000 ms (12013 allocations: 1.14 MiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 63.439 ms (42 allocations: 3.37 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"This page was generated using Literate.jl.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"EditURL = \"mc.jl\"","category":"page"},{"location":"examples/mc/mc/#Parallel-Monte-Carlo","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Calculate the value of pi through parallel direct Monte Carlo.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is pi, the area of the square is 4, and the ratio is pi4. This means that, if you throw N darts randomly at the square, approximately M=Npi4 of those darts will land inside the unit circle.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Throw darts randomly at a unit square and count how many of them (M) landed inside of a unit circle. Approximate pi approx 4MN.","category":"page"},{"location":"examples/mc/mc/#Sequential-implementation:","page":"Parallel Monte Carlo","title":"Sequential implementation:","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"function mc(N)\n M = 0 # number of darts that landed in the circle\n for i in 1:N\n if rand()^2 + rand()^2 < 1.0\n M += 1\n end\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\n\nmc(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.141517","category":"page"},{"location":"examples/mc/mc/#Parallelization-with-tmapreduce","page":"Parallel Monte Carlo","title":"Parallelization with tmapreduce","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and \"throw one dart\" per element.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads\n\nfunction mc_parallel(N)\n M = tmapreduce(+, 1:N) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nmc_parallel(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14159924","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Let's run a quick benchmark.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\n@assert nthreads() > 1 # make sure we have multiple Julia threads\n@show nthreads() # print out the number of threads\n\n@btime mc($N) samples=10 evals=3;\n@btime mc_parallel($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"nthreads() = 5\n 318.467 ms (0 allocations: 0 bytes)\n 88.553 ms (37 allocations: 3.02 KiB)\n","category":"page"},{"location":"examples/mc/mc/#Manual-parallelization","page":"Parallel Monte Carlo","title":"Manual parallelization","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for pi.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: @spawn\n\nfunction mc_parallel_manual(N; nchunks = nthreads())\n tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready\n @spawn mc(length(idcs))\n end\n pi = sum(fetch, tasks) / nchunks\n return pi\nend\n\nmc_parallel_manual(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.1415844","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"And this is the performance:","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"@btime mc_parallel_manual($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 63.825 ms (31 allocations: 2.80 KiB)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"idcs = first(chunks(1:N; n = nthreads()))\n\n@btime mapreduce($+, $idcs) do i\n rand()^2 + rand()^2 < 1.0\nend samples=10 evals=3;\n\n@btime mc($(length(idcs))) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 87.617 ms (0 allocations: 0 bytes)\n 63.398 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/internal/#Internal","page":"Internal","title":"Internal","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"The following is internal, i.e. not public API, and might change at any point.","category":"page"},{"location":"refs/internal/#Index","page":"Internal","title":"Index","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Pages = [\"internal.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/internal/#References","page":"Internal","title":"References","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Modules = [OhMyThreads, OhMyThreads.Tools]\nPublic = false\nPages = [\"OhMyThreads.jl\", \"tools.jl\"]","category":"page"},{"location":"refs/internal/#OhMyThreads.Tools.nthtid-Tuple{Any}","page":"Internal","title":"OhMyThreads.Tools.nthtid","text":"Returns the thread id of the nth Julia thread in the :default threadpool.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.taskid-Tuple{}","page":"Internal","title":"OhMyThreads.Tools.taskid","text":"taskid() :: UInt\n\nReturn a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist. \n\n\n\n\n\n","category":"method"},{"location":"refs/api/#Public-API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"refs/api/#Index","page":"Public API","title":"Index","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"Pages = [\"api.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/api/#Exported","page":"Public API","title":"Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"tmapreduce\ntreduce\ntmap\ntmap!\ntforeach\ntcollect\ntreducemap","category":"page"},{"location":"refs/api/#OhMyThreads.tmapreduce","page":"Public API","title":"OhMyThreads.tmapreduce","text":"tmapreduce(f, op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (√1 + √2) + (√3 + √4) + √5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treduce","page":"Public API","title":"OhMyThreads.treduce","text":"treduce(op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of reduce, sum(A) is equivalent to reduce(+, A). Doing\n\n treduce(+, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (1 + 2) + (3 + 4) + 5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap","page":"Public API","title":"OhMyThreads.tmap","text":"tmap(f, [OutputElementType], A::AbstractArray...; \n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel on multiple tasks. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap!","page":"Public API","title":"OhMyThreads.tmap!","text":"tmap!(f, out, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tforeach","page":"Public API","title":"OhMyThreads.tforeach","text":"tforeach(f, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic) :: Nothing\n\nA multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing, i.e. it is the parallel equivalent of\n\nfor x in A\n f(x)\nend\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tcollect","page":"Public API","title":"OhMyThreads.tcollect","text":"tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};\n nchunks::Int = nthreads(),\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treducemap","page":"Public API","title":"OhMyThreads.treducemap","text":"treducemap(op, f, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nLike tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing\n\n treducemap(+, √, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (√1 + √2) + (√3 + √4) + √5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"as well as the following re-exported functions:","category":"page"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nchunks see ChunkSplitters.jl","category":"page"},{"location":"refs/api/#Non-Exported","page":"Public API","title":"Non-Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.@spawn see StableTasks.jl\nOhMyThreads.@spawnat see StableTasks.jl","category":"page"},{"location":"#OhMyThreads.jl","page":"OhMyThreads","title":"OhMyThreads.jl","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a focus on data parallelism, that can be used without having to worry much about manual Task creation.","category":"page"},{"location":"#Quick-Start","page":"OhMyThreads","title":"Quick Start","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The package is registered. Hence, you can simply use","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"] add OhMyThreads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"to add the package to your Julia environment.","category":"page"},{"location":"#Basic-example","page":"OhMyThreads","title":"Basic example","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"using OhMyThreads\n\nfunction mc_parallel(N; kw...)\n M = tmapreduce(+, 1:N; kw...) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\nmc_parallel(N) # gives, e.g., 3.14159924\n\nusing BenchmarkTools\n\n@show Threads.nthreads() # 5 in this example\n\n@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread\n@btime mc_parallel($N) # running with all 5 Julia threads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Timings might be something like this:","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"438.394 ms (7 allocations: 624 bytes)\n88.050 ms (37 allocations: 3.02 KiB)","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"(Check out the full Parallel Monte Carlo example if you like.)","category":"page"},{"location":"#No-Transducers","page":"OhMyThreads","title":"No Transducers","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.","category":"page"},{"location":"#Acknowledgements","page":"OhMyThreads","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.","category":"page"}] +[{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"EditURL = \"juliaset.jl\"","category":"page"},{"location":"examples/juliaset/juliaset/#Julia-Set","page":"Julia Set","title":"Julia Set","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)\n x = -2.0 + (j - 1) * 4.0 / (n - 1)\n y = -2.0 + (i - 1) * 4.0 / (n - 1)\n\n z = x + y * im\n iter = max_iter\n for k in 1:max_iter\n if abs2(z) > 4.0\n iter = k - 1\n break\n end\n z = z^2 + c\n end\n return iter\nend","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"_compute_pixel (generic function with 1 method)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.","category":"page"},{"location":"examples/juliaset/juliaset/#Sequential-computation","page":"Julia Set","title":"Sequential computation","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"function compute_juliaset_sequential!(img)\n N = size(img, 1)\n for j in 1:N\n for i in 1:N\n img[i, j] = _compute_pixel(i, j, N)\n end\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_sequential!(img);","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's look at the result","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using Plots\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Parallelization","page":"Julia Set","title":"Parallelization","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using OhMyThreads: tmap!\n\nfunction compute_juliaset_parallel!(img; kwargs...)\n N = size(img, 1)\n cart = CartesianIndices(img)\n tmap!(img, eachindex(img); kwargs...) do idx\n c = cart[idx]\n _compute_pixel(c[1], c[2], N)\n end\n return img\nend\n\nN = 2000\nimg = zeros(Int, N, N)\ncompute_juliaset_parallel!(img);\np = heatmap(img)","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"(Image: )","category":"page"},{"location":"examples/juliaset/juliaset/#Benchmark","page":"Julia Set","title":"Benchmark","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Let's benchmark the variants above.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\nN = 2000\nimg = zeros(Int, N, N)\n\n@btime compute_juliaset_sequential!($img) samples=10 evals=3;\n@btime compute_juliaset_parallel!($img) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 138.377 ms (0 allocations: 0 bytes)\n 63.707 ms (39 allocations: 3.30 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As hoped, the parallel implementation is faster. But can we improve the performance further?","category":"page"},{"location":"examples/juliaset/juliaset/#Tuning-nchunks","page":"Julia Set","title":"Tuning nchunks","text":"","category":"section"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 32.000 ms (12013 allocations: 1.14 MiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":" 63.439 ms (42 allocations: 3.37 KiB)\n","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"","category":"page"},{"location":"examples/juliaset/juliaset/","page":"Julia Set","title":"Julia Set","text":"This page was generated using Literate.jl.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"EditURL = \"mc.jl\"","category":"page"},{"location":"examples/mc/mc/#Parallel-Monte-Carlo","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Calculate the value of pi through parallel direct Monte Carlo.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is pi, the area of the square is 4, and the ratio is pi4. This means that, if you throw N darts randomly at the square, approximately M=Npi4 of those darts will land inside the unit circle.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Throw darts randomly at a unit square and count how many of them (M) landed inside of a unit circle. Approximate pi approx 4MN.","category":"page"},{"location":"examples/mc/mc/#Sequential-implementation:","page":"Parallel Monte Carlo","title":"Sequential implementation:","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"function mc(N)\n M = 0 # number of darts that landed in the circle\n for i in 1:N\n if rand()^2 + rand()^2 < 1.0\n M += 1\n end\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\n\nmc(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.141517","category":"page"},{"location":"examples/mc/mc/#Parallelization-with-tmapreduce","page":"Parallel Monte Carlo","title":"Parallelization with tmapreduce","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and \"throw one dart\" per element.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads\n\nfunction mc_parallel(N)\n M = tmapreduce(+, 1:N) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nmc_parallel(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.14159924","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"Let's run a quick benchmark.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using BenchmarkTools\nusing Base.Threads: nthreads\n\n@assert nthreads() > 1 # make sure we have multiple Julia threads\n@show nthreads() # print out the number of threads\n\n@btime mc($N) samples=10 evals=3;\n@btime mc_parallel($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"nthreads() = 5\n 318.467 ms (0 allocations: 0 bytes)\n 88.553 ms (37 allocations: 3.02 KiB)\n","category":"page"},{"location":"examples/mc/mc/#Manual-parallelization","page":"Parallel Monte Carlo","title":"Manual parallelization","text":"","category":"section"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"First, using the chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for pi.","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"using OhMyThreads: @spawn\n\nfunction mc_parallel_manual(N; nchunks = nthreads())\n tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready\n @spawn mc(length(idcs))\n end\n pi = sum(fetch, tasks) / nchunks\n return pi\nend\n\nmc_parallel_manual(N)","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"3.1415844","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"And this is the performance:","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"@btime mc_parallel_manual($N) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 63.825 ms (31 allocations: 2.80 KiB)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"idcs = first(chunks(1:N; n = nthreads()))\n\n@btime mapreduce($+, $idcs) do i\n rand()^2 + rand()^2 < 1.0\nend samples=10 evals=3;\n\n@btime mc($(length(idcs))) samples=10 evals=3;","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":" 87.617 ms (0 allocations: 0 bytes)\n 63.398 ms (0 allocations: 0 bytes)\n","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"","category":"page"},{"location":"examples/mc/mc/","page":"Parallel Monte Carlo","title":"Parallel Monte Carlo","text":"This page was generated using Literate.jl.","category":"page"},{"location":"refs/internal/#Internal","page":"Internal","title":"Internal","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"The following is internal, i.e. not public API, and might change at any point.","category":"page"},{"location":"refs/internal/#Index","page":"Internal","title":"Index","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Pages = [\"internal.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/internal/#References","page":"Internal","title":"References","text":"","category":"section"},{"location":"refs/internal/","page":"Internal","title":"Internal","text":"Modules = [OhMyThreads, OhMyThreads.Tools]\nPublic = false\nPages = [\"OhMyThreads.jl\", \"tools.jl\"]","category":"page"},{"location":"refs/internal/#OhMyThreads.Tools.nthtid-Tuple{Any}","page":"Internal","title":"OhMyThreads.Tools.nthtid","text":"Returns the thread id of the nth Julia thread in the :default threadpool.\n\n\n\n\n\n","category":"method"},{"location":"refs/internal/#OhMyThreads.Tools.taskid-Tuple{}","page":"Internal","title":"OhMyThreads.Tools.taskid","text":"taskid() :: UInt\n\nReturn a UInt identifier for the current running Task. This identifier will be unique so long as references to the task it came from still exist. \n\n\n\n\n\n","category":"method"},{"location":"refs/api/#API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"refs/api/#Index","page":"Public API","title":"Index","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"Pages = [\"api.md\"]\nOrder = [:function, :macro]","category":"page"},{"location":"refs/api/#Exported","page":"Public API","title":"Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"tmapreduce\ntreduce\ntmap\ntmap!\ntforeach\ntcollect\ntreducemap","category":"page"},{"location":"refs/api/#OhMyThreads.tmapreduce","page":"Public API","title":"OhMyThreads.tmapreduce","text":"tmapreduce(f, op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing\n\n tmapreduce(√, +, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (√1 + √2) + (√3 + √4) + √5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treduce","page":"Public API","title":"OhMyThreads.treduce","text":"treduce(op, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nA multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of reduce, sum(A) is equivalent to reduce(+, A). Doing\n\n treduce(+, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (1 + 2) + (3 + 4) + 5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap","page":"Public API","title":"OhMyThreads.tmap","text":"tmap(f, [OutputElementType], A::AbstractArray...; \n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel on multiple tasks. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tmap!","page":"Public API","title":"OhMyThreads.tmap!","text":"tmap!(f, out, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tforeach","page":"Public API","title":"OhMyThreads.tforeach","text":"tforeach(f, A::AbstractArray...;\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic) :: Nothing\n\nA multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing, i.e. it is the parallel equivalent of\n\nfor x in A\n f(x)\nend\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.tcollect","page":"Public API","title":"OhMyThreads.tcollect","text":"tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};\n nchunks::Int = nthreads(),\n schedule::Symbol =:dynamic)\n\nA multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs. The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.\n\nKeyword arguments:\n\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/#OhMyThreads.treducemap","page":"Public API","title":"OhMyThreads.treducemap","text":"treducemap(op, f, A::AbstractArray...;\n [init],\n nchunks::Int = nthreads(),\n split::Symbol = :batch,\n schedule::Symbol =:dynamic,\n outputtype::Type = Any)\n\nLike tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op. op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.\n\nFor a very well known example of mapreduce, sum(f, A) is equivalent to mapreduce(f, +, A). Doing\n\n treducemap(+, √, [1, 2, 3, 4, 5])\n\nis the parallelized version of\n\n (√1 + √2) + (√3 + √4) + √5\n\nThis data is divided into chunks to be worked on in parallel using ChunkSplitters.jl.\n\nKeyword arguments:\n\ninit optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.\nnchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.\nsplit::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!\nschedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of\n:dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.\n:static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.\n:greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.\n:interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.\noutputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only\n\nneeded if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.\n\n\n\n\n\n","category":"function"},{"location":"refs/api/","page":"Public API","title":"Public API","text":"as well as the following re-exported functions:","category":"page"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nchunks see ChunkSplitters.jl","category":"page"},{"location":"refs/api/#Non-Exported","page":"Public API","title":"Non-Exported","text":"","category":"section"},{"location":"refs/api/","page":"Public API","title":"Public API","text":" \nOhMyThreads.@spawn see StableTasks.jl\nOhMyThreads.@spawnat see StableTasks.jl","category":"page"},{"location":"translation/#Translation-Guide","page":"Translation Guide","title":"Translation Guide","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"This page tries to give a general overview of how to translate patterns written with the built-in tools of Base.Threads using the OhMyThreads.jl API. Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.","category":"page"},{"location":"translation/#Basics","page":"Translation Guide","title":"Basics","text":"","category":"section"},{"location":"translation/#@threads","page":"Translation Guide","title":"@threads","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@threads for i in 1:10\n println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10) do i\n println(i)\nend","category":"page"},{"location":"translation/#:static-scheduling","page":"Translation Guide","title":":static scheduling","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@threads :static for i in 1:10\n println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10; schedule=:static) do i\n println(i)\nend","category":"page"},{"location":"translation/#@spawn","page":"Translation Guide","title":"@spawn","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\n@sync for i in 1:10\n @spawn println(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ntforeach(1:10; nchunks=10) do i\n println(i)\nend","category":"page"},{"location":"translation/#Reduction","page":"Translation Guide","title":"Reduction","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"No built-in feature in Base.Threads.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads: basic manual implementation\ndata = rand(10)\nchunks_itr = Iterators.partition(data, length(data) ÷ nthreads())\ntasks = map(chunks_itr) do chunk\n @spawn reduce(+, chunk)\nend\nreduce(+, fetch.(tasks))","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads\ndata = rand(10)\ntreduce(+, data)","category":"page"},{"location":"translation/#Mutation","page":"Translation Guide","title":"Mutation","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"warning: Warning\nParallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. false sharing). You should carefully consider whether this is necessary or whether the use of task-local storage is the better option.","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\ndata = rand(10)\n@threads for i in 1:10\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 1\ndata = rand(10)\ntforeach(data) do i\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 2\ndata = rand(10)\ntmap!(data, data) do i # this kind of aliasing is fine\n calc(i)\nend","category":"page"},{"location":"translation/#Parallel-initialization","page":"Translation Guide","title":"Parallel initialization","text":"","category":"section"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# Base.Threads\ndata = Vector{Float64}(undef, 10)\n@threads for i in 1:10\n data[i] = calc(i)\nend","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 1\ndata = tmap(i->calc(i), 1:10)","category":"page"},{"location":"translation/","page":"Translation Guide","title":"Translation Guide","text":"# OhMyThreads: Variant 2\ndata = tcollect(calc(i) for i in 1:10)","category":"page"},{"location":"#OhMyThreads.jl","page":"OhMyThreads","title":"OhMyThreads.jl","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a focus on data parallelism, that can be used without having to worry much about manual Task creation.","category":"page"},{"location":"#Quick-Start","page":"OhMyThreads","title":"Quick Start","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The package is registered. Hence, you can simply use","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"] add OhMyThreads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"to add the package to your Julia environment.","category":"page"},{"location":"#Basic-example","page":"OhMyThreads","title":"Basic example","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"using OhMyThreads\n\nfunction mc_parallel(N; kw...)\n M = tmapreduce(+, 1:N; kw...) do i\n rand()^2 + rand()^2 < 1.0\n end\n pi = 4 * M / N\n return pi\nend\n\nN = 100_000_000\nmc_parallel(N) # gives, e.g., 3.14159924\n\nusing BenchmarkTools\n\n@show Threads.nthreads() # 5 in this example\n\n@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread\n@btime mc_parallel($N) # running with all 5 Julia threads","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Timings might be something like this:","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"438.394 ms (7 allocations: 624 bytes)\n88.050 ms (37 allocations: 3.02 KiB)","category":"page"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"(Check out the full Parallel Monte Carlo example if you like.)","category":"page"},{"location":"#No-Transducers","page":"OhMyThreads","title":"No Transducers","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.","category":"page"},{"location":"#Acknowledgements","page":"OhMyThreads","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"OhMyThreads","title":"OhMyThreads","text":"The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.","category":"page"}] } diff --git a/dev/translation/index.html b/dev/translation/index.html new file mode 100644 index 00000000..d2d2bf17 --- /dev/null +++ b/dev/translation/index.html @@ -0,0 +1,46 @@ + +Translation Guide · OhMyThreads.jl

Translation Guide

This page tries to give a general overview of how to translate patterns written with the built-in tools of Base.Threads using the OhMyThreads.jl API. Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.

Basics

@threads

# Base.Threads
+@threads for i in 1:10
+    println(i)
+end
# OhMyThreads
+tforeach(1:10) do i
+    println(i)
+end

:static scheduling

# Base.Threads
+@threads :static for i in 1:10
+    println(i)
+end
# OhMyThreads
+tforeach(1:10; schedule=:static) do i
+    println(i)
+end

@spawn

# Base.Threads
+@sync for i in 1:10
+    @spawn println(i)
+end
# OhMyThreads
+tforeach(1:10; nchunks=10) do i
+    println(i)
+end

Reduction

No built-in feature in Base.Threads.

# Base.Threads: basic manual implementation
+data = rand(10)
+chunks_itr = Iterators.partition(data, length(data) ÷ nthreads())
+tasks = map(chunks_itr) do chunk
+    @spawn reduce(+, chunk)
+end
+reduce(+, fetch.(tasks))
# OhMyThreads
+data = rand(10)
+treduce(+, data)

Mutation

Warning

Parallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. false sharing). You should carefully consider whether this is necessary or whether the use of task-local storage is the better option.

# Base.Threads
+data = rand(10)
+@threads for i in 1:10
+    data[i] = calc(i)
+end
# OhMyThreads: Variant 1
+data = rand(10)
+tforeach(data) do i
+    data[i] = calc(i)
+end
# OhMyThreads: Variant 2
+data = rand(10)
+tmap!(data, data) do i # this kind of aliasing is fine
+    calc(i)
+end

Parallel initialization

# Base.Threads
+data = Vector{Float64}(undef, 10)
+@threads for i in 1:10
+    data[i] = calc(i)
+end
# OhMyThreads: Variant 1
+data = tmap(i->calc(i), 1:10)
# OhMyThreads: Variant 2
+data = tcollect(calc(i) for i in 1:10)