From 37c217e036a1cf5d61b1e5526193cd1cfe81aa0a Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Tue, 6 Feb 2024 21:26:07 +0000 Subject: [PATCH] build based on caf5a9f --- dev/.documenter-siteinfo.json | 2 +- dev/examples/integration/integration/index.html | 4 ++-- dev/examples/juliaset/juliaset/index.html | 4 ++-- dev/examples/mc/mc/index.html | 4 ++-- dev/examples/tls/tls/index.html | 4 ++-- dev/index.html | 4 ++-- dev/refs/api/index.html | 16 ++++++++-------- dev/refs/internal/index.html | 2 +- dev/translation/index.html | 4 ++-- 9 files changed, 22 insertions(+), 22 deletions(-) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index b129d4e3..c0320da5 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-06T21:23:23","documenter_version":"1.2.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-02-06T21:26:04","documenter_version":"1.2.1"}} \ No newline at end of file diff --git a/dev/examples/integration/integration/index.html b/dev/examples/integration/integration/index.html index 21719df5..20bcee34 100644 --- a/dev/examples/integration/integration/index.html +++ b/dev/examples/integration/integration/index.html @@ -1,5 +1,5 @@ -Trapezoidal Integration · OhMyThreads.jl

Trapezoidal Integration

In this example, we want to parallelize the computation of a simple numerical integral via the trapezoidal rule. The latter is given by

\[\int_{a}^{b}f(x)\,dx \approx h \sum_{i=1}^{N}\frac{f(x_{i-1})+f(x_{i})}{2}.\]

The function to be integrated is the following.

f(x) = 4 * √(1 - x^2)
f (generic function with 1 method)

The analytic result of the definite integral (from 0 to 1) is known to be $\pi$.

Sequential

Naturally, we implement the trapezoidal rule as a straightforward, sequential for loop.

function trapezoidal(a, b, n; h = (b - a) / n)
+Trapezoidal Integration · OhMyThreads.jl

Trapezoidal Integration

In this example, we want to parallelize the computation of a simple numerical integral via the trapezoidal rule. The latter is given by

\[\int_{a}^{b}f(x)\,dx \approx h \sum_{i=1}^{N}\frac{f(x_{i-1})+f(x_{i})}{2}.\]

The function to be integrated is the following.

f(x) = 4 * √(1 - x^2)
f (generic function with 1 method)

The analytic result of the definite integral (from 0 to 1) is known to be $\pi$.

Sequential

Naturally, we implement the trapezoidal rule as a straightforward, sequential for loop.

function trapezoidal(a, b, n; h = (b - a) / n)
     y = (f(a) + f(b)) / 2.0
     for i in 1:(n - 1)
         x = a + i * h
@@ -22,4 +22,4 @@
 @btime trapezoidal(0, 1, $N);
 @btime trapezoidal_parallel(0, 1, $N);
  13.871 ms (0 allocations: 0 bytes)
   2.781 ms (38 allocations: 3.19 KiB)
-

Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.


This page was generated using Literate.jl.

+

Because the problem is trivially parallel - all threads to the same thing and don't need to communicate - we expect an ideal speedup of (close to) the number of available threads.


This page was generated using Literate.jl.

diff --git a/dev/examples/juliaset/juliaset/index.html b/dev/examples/juliaset/juliaset/index.html index 71fd6997..71b4c759 100644 --- a/dev/examples/juliaset/juliaset/index.html +++ b/dev/examples/juliaset/juliaset/index.html @@ -1,5 +1,5 @@ -Julia Set · OhMyThreads.jl

Julia Set

In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.

The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.

function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)
+Julia Set · OhMyThreads.jl

Julia Set

In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.

The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.

function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)
     x = -2.0 + (j - 1) * 4.0 / (n - 1)
     y = -2.0 + (i - 1) * 4.0 / (n - 1)
 
@@ -52,4 +52,4 @@
   63.707 ms (39 allocations: 3.30 KiB)
 

As hoped, the parallel implementation is faster. But can we improve the performance further?

Tuning nchunks

As stated above, the per-pixel computation is non-uniform. Hence, we might benefit from load balancing. The simplest way to get it is to increase nchunks to a value larger than nthreads. This divides the overall workload into smaller tasks than can be dynamically distributed among threads (by Julia's scheduler) to balance the per-thread load.

@btime compute_juliaset_parallel!($img; schedule=:dynamic, nchunks=N) samples=10 evals=3;
  32.000 ms (12013 allocations: 1.14 MiB)
 

Note that if we opt out of dynamic scheduling and set schedule=:static, this strategy doesn't help anymore (because chunks are naively distributed up front).

@btime compute_juliaset_parallel!($img; schedule=:static, nchunks=N) samples=10 evals=3;
  63.439 ms (42 allocations: 3.37 KiB)
-

This page was generated using Literate.jl.

+

This page was generated using Literate.jl.

diff --git a/dev/examples/mc/mc/index.html b/dev/examples/mc/mc/index.html index 808e15d8..f8818202 100644 --- a/dev/examples/mc/mc/index.html +++ b/dev/examples/mc/mc/index.html @@ -1,5 +1,5 @@ -Parallel Monte Carlo · OhMyThreads.jl

Parallel Monte Carlo

Calculate the value of $\pi$ through parallel direct Monte Carlo.

A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is $\pi$, the area of the square is 4, and the ratio is $\pi/4$. This means that, if you throw $N$ darts randomly at the square, approximately $M=N\pi/4$ of those darts will land inside the unit circle.

Throw darts randomly at a unit square and count how many of them ($M$) landed inside of a unit circle. Approximate $\pi \approx 4M/N$.

Sequential implementation:

function mc(N)
+Parallel Monte Carlo · OhMyThreads.jl

Parallel Monte Carlo

Calculate the value of $\pi$ through parallel direct Monte Carlo.

A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is $\pi$, the area of the square is 4, and the ratio is $\pi/4$. This means that, if you throw $N$ darts randomly at the square, approximately $M=N\pi/4$ of those darts will land inside the unit circle.

Throw darts randomly at a unit square and count how many of them ($M$) landed inside of a unit circle. Approximate $\pi \approx 4M/N$.

Sequential implementation:

function mc(N)
     M = 0 # number of darts that landed in the circle
     for i in 1:N
         if rand()^2 + rand()^2 < 1.0
@@ -51,4 +51,4 @@
 
 @btime mc($(length(idcs))) samples=10 evals=3;
  87.617 ms (0 allocations: 0 bytes)
   63.398 ms (0 allocations: 0 bytes)
-

This page was generated using Literate.jl.

+

This page was generated using Literate.jl.

diff --git a/dev/examples/tls/tls/index.html b/dev/examples/tls/tls/index.html index 0f7e66fe..ca991dd7 100644 --- a/dev/examples/tls/tls/index.html +++ b/dev/examples/tls/tls/index.html @@ -1,5 +1,5 @@ -Task-Local Storage · OhMyThreads.jl

Task-Local Storage

For some programs, it can be useful or even necessary to allocate and (re-)use memory in your parallel code. The following section uses a simple example to explain how task-local values can be efficiently created and (re-)used.

Sequential

Let's say that we are given two arrays of (square) matrices, As and Bs, and let's further assume that our goal is to compute the total sum of all pairwise matrix products. We can readily implement a (sequential) function that performs the necessary computations.

using LinearAlgebra: mul!, BLAS
+Task-Local Storage · OhMyThreads.jl

Task-Local Storage

For some programs, it can be useful or even necessary to allocate and (re-)use memory in your parallel code. The following section uses a simple example to explain how task-local values can be efficiently created and (re-)used.

Sequential

Let's say that we are given two arrays of (square) matrices, As and Bs, and let's further assume that our goal is to compute the total sum of all pairwise matrix products. We can readily implement a (sequential) function that performs the necessary computations.

using LinearAlgebra: mul!, BLAS
 BLAS.set_num_threads(1) # for simplicity, we turn off OpenBLAS multithreading
 
 function matmulsums(As, Bs)
@@ -75,4 +75,4 @@
   603.631 ms (174 allocations: 512.01 MiB)
   578.180 ms (67 allocations: 40.01 MiB)
   578.769 ms (50 allocations: 40.01 MiB)
-

As we see, the recommened version matmulsums_tls is both convenient as well as efficient: It allocates much less memory than matmulsums_naive (5 vs 64 times 8 MiB) and is very much comparable to the manual implementation.


This page was generated using Literate.jl.

+

As we see, the recommened version matmulsums_tls is both convenient as well as efficient: It allocates much less memory than matmulsums_naive (5 vs 64 times 8 MiB) and is very much comparable to the manual implementation.


This page was generated using Literate.jl.

diff --git a/dev/index.html b/dev/index.html index 06b83712..712574b1 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,5 +1,5 @@ -OhMyThreads · OhMyThreads.jl

OhMyThreads.jl

OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a focus on data parallelism, that can be used without having to worry much about manual Task creation.

Quick Start

The package is registered. Hence, you can simply use

] add OhMyThreads

to add the package to your Julia environment.

Basic example

using OhMyThreads
+OhMyThreads · OhMyThreads.jl

OhMyThreads.jl

OhMyThreads.jl is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a focus on data parallelism, that can be used without having to worry much about manual Task creation.

Quick Start

The package is registered. Hence, you can simply use

] add OhMyThreads

to add the package to your Julia environment.

Basic example

using OhMyThreads
 
 function mc_parallel(N; kw...)
     M = tmapreduce(+, 1:N; kw...) do i
@@ -18,4 +18,4 @@
 
 @btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread
 @btime mc_parallel($N)            # running with all 5 Julia threads

Timings might be something like this:

438.394 ms (7 allocations: 624 bytes)
-88.050 ms (37 allocations: 3.02 KiB)

(Check out the full Parallel Monte Carlo example if you like.)

No Transducers

Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.

Acknowledgements

The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.

+88.050 ms (37 allocations: 3.02 KiB)

(Check out the full Parallel Monte Carlo example if you like.)

No Transducers

Unlike most JuliaFolds2 packages, OhMyThreads.jl is not built off of Transducers.jl, nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., ThreadsX.jl or Folds.jl.

Acknowledgements

The idea for this package came from Carsten Bauer and Mason Protter. Check out the list of contributors for more information.

diff --git a/dev/refs/api/index.html b/dev/refs/api/index.html index 34d80c69..f0bd629f 100644 --- a/dev/refs/api/index.html +++ b/dev/refs/api/index.html @@ -1,32 +1,32 @@ -Public API · OhMyThreads.jl

Public API

Index

Exported

OhMyThreads.tmapreduceFunction
tmapreduce(f, op, A::AbstractArray...;
+Public API · OhMyThreads.jl

Public API

Index

Exported

OhMyThreads.tmapreduceFunction
tmapreduce(f, op, A::AbstractArray...;
            [init],
            nchunks::Int = nthreads(),
            split::Symbol = :batch,
            schedule::Symbol =:dynamic,
-           outputtype::Type = Any)

A multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??tmapreduce.

Example:

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

 (√1 + √2) + (√3 + √4) + √5

Extended help

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.treduceFunction
treduce(op, A::AbstractArray...;
+           outputtype::Type = Any)

A multithreaded function like Base.mapreduce. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??tmapreduce.

Example:

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

 (√1 + √2) + (√3 + √4) + √5

Extended help

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.treduceFunction
treduce(op, A::AbstractArray...;
         [init],
         nchunks::Int = nthreads(),
         split::Symbol = :batch,
         schedule::Symbol =:dynamic,
-        outputtype::Type = Any)

A multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??treduce.

Example:

    treduce(+, [1, 2, 3, 4, 5])

is the parallelized version of sum([1, 2, 3, 4, 5]) in the form

    (1 + 2) + (3 + 4) + 5

Extended help

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.tmapFunction
tmap(f, [OutputElementType], A::AbstractArray...;
+        outputtype::Type = Any)

A multithreaded function like Base.reduce. Perform a reduction over A using the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??treduce.

Example:

    treduce(+, [1, 2, 3, 4, 5])

is the parallelized version of sum([1, 2, 3, 4, 5]) in the form

    (1 + 2) + (3 + 4) + 5

Extended help

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source
OhMyThreads.tmapFunction
tmap(f, [OutputElementType], A::AbstractArray...;
      nchunks::Int = nthreads(),
      split::Symbol = :batch,
-     schedule::Symbol =:dynamic)

A multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel: the data is divided into chunks and a parallel task is created per chunk.

The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

To see the keyword argument options, check out ??tmap.

Example:

    tmap(sin, 1:10)

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tmap!Function
tmap!(f, out, A::AbstractArray...;
+     schedule::Symbol =:dynamic)

A multithreaded function like Base.map. Create a new container similar to A whose ith element is equal to f(A[i]). This container is filled in parallel: the data is divided into chunks and a parallel task is created per chunk.

The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

To see the keyword argument options, check out ??tmap.

Example:

    tmap(sin, 1:10)

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tmap!Function
tmap!(f, out, A::AbstractArray...;
       nchunks::Int = nthreads(),
       split::Symbol = :batch,
-      schedule::Symbol =:dynamic)

A multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??tmap!.

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tforeachFunction
tforeach(f, A::AbstractArray...;
+      schedule::Symbol =:dynamic)

A multithreaded function like Base.map!. In parallel on multiple tasks, this function assigns each element of out[i] = f(A[i]) for each index i of A and out.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??tmap!.

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tforeachFunction
tforeach(f, A::AbstractArray...;
          nchunks::Int = nthreads(),
          split::Symbol = :batch,
          schedule::Symbol =:dynamic) :: Nothing

A multithreaded function like Base.foreach. Apply f to each element of A on multiple parallel tasks, and return nothing. I.e. it is the parallel equivalent of

for x in A
     f(x)
 end

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??tforeach.

Example:

    tforeach(1:10) do i
         println(i^2)
-    end

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tcollectFunction
tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
+    end

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.tcollectFunction
tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
          nchunks::Int = nthreads(),
-         schedule::Symbol =:dynamic)

A multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.

The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

To see the keyword argument options, check out ??tcollect.

Example:

    tcollect(sin(i) for i in 1:10)

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.treducemapFunction
treducemap(op, f, A::AbstractArray...;
+         schedule::Symbol =:dynamic)

A multithreaded function like Base.collect. Essentially just calls tmap on the generator function and inputs.

The optional argument OutputElementType will select a specific element type for the returned container, and will generally incur fewer allocations than the version where OutputElementType is not specified.

To see the keyword argument options, check out ??tcollect.

Example:

    tcollect(sin(i) for i in 1:10)

Extended help

Keyword arguments:

  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the OutputElementType argument is provided.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
source
OhMyThreads.treducemapFunction
treducemap(op, f, A::AbstractArray...;
            [init],
            nchunks::Int = nthreads(),
            split::Symbol = :batch,
            schedule::Symbol =:dynamic,
-           outputtype::Type = Any)

Like tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??treducemap.

Example:

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

 (√1 + √2) + (√3 + √4) + √5

Extended help

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source

Non-Exported

OhMyThreads.@spawnsee StableTasks.jl
OhMyThreads.@spawnatsee StableTasks.jl
OhMyThreads.chunkssee ChunkSplitters.jl
OhMyThreads.TaskLocalValuesee TaskLocalValues.jl
+ outputtype::Type = Any)

Like tmapreduce except the order of the f and op arguments are switched. This is sometimes convenient with do-block notation. Perform a reduction over A, applying a single-argument function f to each element, and then combining them with the two-argument function op.

Note that op must be an associative function, in the sense that op(a, op(b, c)) ≈ op(op(a, b), c). If op is not (approximately) associative, you will get undefined results.

For parallelization, the data is divided into chunks and a parallel task is created per chunk.

To see the keyword argument options, check out ??treducemap.

Example:

 tmapreduce(√, +, [1, 2, 3, 4, 5])

is the parallelized version of sum(√, [1, 2, 3, 4, 5]) in the form

 (√1 + √2) + (√3 + √4) + √5

Extended help

Keyword arguments:

  • init optional keyword argument forwarded to mapreduce for the sequential parts of the calculation.
  • nchunks::Int (default nthreads()) is passed to ChunkSplitters.chunks to inform it how many pieces of data should be worked on in parallel. Greater nchunks typically helps with load balancing, but at the expense of creating more overhead.
  • split::Symbol (default :batch) is passed to ChunkSplitters.chunks to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If scatter is chosen, then your reducing operator op must be commutative in addition to being associative, or you could get incorrect results!
  • schedule::Symbol (default :dynamic), determines how the parallel portions of the calculation are scheduled. Options are one of
    • :dynamic: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
    • :static: can sometimes be more performant than :dynamic when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
    • :greedy: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of A in a non-deterministic order, and thus your reducing op must be commutative in addition to being associative, or you could get incorrect results! This schedule will however work with non-AbstractArray iterables. If you use the :greedy scheduler, we strongly recommend you provide an init keyword argument.
    • :interactive: like :dynamic but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without yielding as it can interfere with heartbeat processes running on the interactive threadpool.
  • outputtype::Type (default Any) will work as the asserted output type of parallel calculations. This is typically only

needed if you are using a :static schedule, since the :dynamic schedule is uses StableTasks.jl, but if you experience problems with type stability, you may be able to recover it with the outputtype keyword argument.

source

Non-Exported

OhMyThreads.@spawnsee StableTasks.jl
OhMyThreads.@spawnatsee StableTasks.jl
OhMyThreads.chunkssee ChunkSplitters.jl
OhMyThreads.TaskLocalValuesee TaskLocalValues.jl
diff --git a/dev/refs/internal/index.html b/dev/refs/internal/index.html index d22ce031..04d0a71a 100644 --- a/dev/refs/internal/index.html +++ b/dev/refs/internal/index.html @@ -1,2 +1,2 @@ -Internal · OhMyThreads.jl
+Internal · OhMyThreads.jl
diff --git a/dev/translation/index.html b/dev/translation/index.html index 99e405a4..e82273bd 100644 --- a/dev/translation/index.html +++ b/dev/translation/index.html @@ -1,5 +1,5 @@ -Translation Guide · OhMyThreads.jl

Translation Guide

This page tries to give a general overview of how to translate patterns written with the built-in tools of Base.Threads using the OhMyThreads.jl API. Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.

Basics

@threads

# Base.Threads
+Translation Guide · OhMyThreads.jl

Translation Guide

This page tries to give a general overview of how to translate patterns written with the built-in tools of Base.Threads using the OhMyThreads.jl API. Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.

Basics

@threads

# Base.Threads
 @threads for i in 1:10
     println(i)
 end
# OhMyThreads
@@ -43,4 +43,4 @@
     data[i] = calc(i)
 end
# OhMyThreads: Variant 1
 data = tmap(i->calc(i), 1:10)
# OhMyThreads: Variant 2
-data = tcollect(calc(i) for i in 1:10)
+data = tcollect(calc(i) for i in 1:10)