From aeb9e6465d922119eafa1c456c02eb40425675aa Mon Sep 17 00:00:00 2001 From: Carsten Bauer Date: Fri, 2 Feb 2024 12:23:26 +0100 Subject: [PATCH 01/12] readme + index.md update --- README.md | 288 ++++------------------------------------------ docs/src/index.md | 45 ++++++-- 2 files changed, 63 insertions(+), 270 deletions(-) diff --git a/README.md b/README.md index 1a936e69..e5f9a1c0 100644 --- a/README.md +++ b/README.md @@ -32,283 +32,45 @@ |:-------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------:| | [![][docs-stable-img]][docs-stable-url] [![][docs-dev-img]][docs-dev-url] | [![][ci-img]][ci-url] [![][cov-img]][cov-url] | ![][lifecycle-img] | -This is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel -multithreaded calculations via higher-order functions, with a focus on -[data parallelism](https://en.wikipedia.org/wiki/Data_parallelism) without needing to expose julia's -[Task](https://docs.julialang.org/en/v1/base/parallel/) model to users. +[OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl/) is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a +focus on [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism), that can be used without having to worry much about manual [Task](https://docs.julialang.org/en/v1/base/parallel/) creation. -Unlike most JuliaFolds2 packages, it is not built off of -[Transducers.jl](https://github.com/JuliaFolds2/Transducers.jl), nor is it a building block for Transducers.jl. -Rather, OhMyThreads is meant to be a simpler, more maintainable, and more accessible alternative to packages -like [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) or [Folds.jl](https://github.com/JuliaFolds2/Folds.jl). +Unlike most [JuliaFolds2](https://github.com/JuliaFolds2) packages, OhMyThreads.jl is not built off of [Transducers.jl](https://github.com/JuliaFolds2/Transducers.jl), nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) or [Folds.jl](https://github.com/JuliaFolds2/Folds.jl). -OhMyThreads.jl re-exports the function `chunks` from -[ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl), and provides the following functions: +## Example -
tmapreduce -

+```julia +using OhMyThreads -``` -tmapreduce(f, op, A::AbstractArray...; - [init], - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic, - outputtype::Type = Any) -``` - -A multithreaded function like `Base.mapreduce`. Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined results. - -For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing - -``` - tmapreduce(√, +, [1, 2, 3, 4, 5]) -``` - -is the parallelized version of - -``` - (√1 + √2) + (√3 + √4) + √5 -``` - -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). - -## Keyword arguments: - - * `init` optional keyword argument forwarded to `mapreduce` for the sequential parts of the calculation. - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of `A` in a non-deterministic order, and thus your reducing `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! This schedule will however work with non-`AbstractArray` iterables. If you use the `:greedy` scheduler, we strongly recommend you provide an `init` keyword argument. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - * `outputtype::Type` (default `Any`) will work as the asserted output type of parallel calculations. This is typically only - -needed if you are using a `:static` schedule, since the `:dynamic` schedule is uses [StableTasks.jl](https://github.com/MasonProtter/StableTasks.jl), but if you experience problems with type stability, you may be able to recover it with the `outputtype` keyword argument. - - -

-

- -____________________________ - -
treducemap -

- -``` -treducemap(op, f, A::AbstractArray...; - [init], - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic, - outputtype::Type = Any) -``` - -Like `tmapreduce` except the order of the `f` and `op` arguments are switched. This is sometimes convenient with `do`-block notation. Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined results. - -For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing - -``` - treducemap(+, √, [1, 2, 3, 4, 5]) -``` - -is the parallelized version of - -``` - (√1 + √2) + (√3 + √4) + √5 -``` - -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). - -## Keyword arguments: - - * `init` optional keyword argument forwarded to `mapreduce` for the sequential parts of the calculation. - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of `A` in a non-deterministic order, and thus your reducing `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! This schedule will however work with non-`AbstractArray` iterables. If you use the `:greedy` scheduler, we strongly recommend you provide an `init` keyword argument. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - * `outputtype::Type` (default `Any`) will work as the asserted output type of parallel calculations. This is typically only - -needed if you are using a `:static` schedule, since the `:dynamic` schedule is uses [StableTasks.jl](https://github.com/MasonProtter/StableTasks.jl), but if you experience problems with type stability, you may be able to recover it with the `outputtype` keyword argument. - - -

-

- -____________________________ - -
treduce -

- -``` -treduce(op, A::AbstractArray...; - [init], - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic, - outputtype::Type = Any) -``` - -Like `tmapreduce` except the order of the `f` and `op` arguments are switched. Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined results. - -For a very well known example of `reduce`, `sum(A)` is equivalent to `reduce(+, A)`. Doing - -``` - treduce(+, [1, 2, 3, 4, 5]) -``` - -is the parallelized version of - -``` - (1 + 2) + (3 + 4) + 5 -``` - -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). - -## Keyword arguments: - - * `init` optional keyword argument forwarded to `mapreduce` for the sequential parts of the calculation. - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of `A` in a non-deterministic order, and thus your reducing `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! This schedule will however work with non-`AbstractArray` iterables. If you use the `:greedy` scheduler, we strongly recommend you provide an `init` keyword argument. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - * `outputtype::Type` (default `Any`) will work as the asserted output type of parallel calculations. This is typically only - -needed if you are using a `:static` schedule, since the `:dynamic` schedule is uses [StableTasks.jl](https://github.com/MasonProtter/StableTasks.jl), but if you experience problems with type stability, you may be able to recover it with the `outputtype` keyword argument. - - -

-

- -____________________________ - -
tmap -

- -``` -tmap(f, [OutputElementType], A::AbstractArray...; - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic) -``` - -A multithreaded function like `Base.map`. Create a new container `similar` to `A` whose `i`th element is equal to `f(A[i])`. This container is filled in parallel on multiple tasks. The optional argument `OutputElementType` will select a specific element type for the returned container, and will generally incur fewer allocations than the version where `OutputElementType` is not specified. - -## Keyword arguments: - - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the `OutputElementType` argument is provided. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - - -

-

- -____________________________ - -
tmap! -

- -``` -tmap!(f, out, A::AbstractArray...; - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic) -``` - -A multithreaded function like `Base.map!`. In parallel on multiple tasks, this function assigns each element of `out[i] = f(A[i])` for each index `i` of `A` and `out`. - -## Keyword arguments: - - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - - -

-

- -____________________________ - -
tforeach -

- -``` -tforeach(f, A::AbstractArray...; - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic) :: Nothing -``` - -A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` on multiple parallel tasks, and return `nothing`, i.e. it is the parallel equivalent of - -``` -for x in A - f(x) +function mc_parallel(N; kw...) + M = tmapreduce(+, 1:N; kw...) do i + rand()^2 + rand()^2 < 1.0 + end + pi = 4 * M / N + return pi end -``` - -## Keyword arguments: - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of +N = 100_000_000 +mc_parallel(N) # gives, e.g., 3.14159924 - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. +using BenchmarkTools +@assert Threads.nthreads() == 5 -

-

- -____________________________ +@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread +@btime mc_parallel($N) # running with all 5 Julia threads +``` -
tcollect -

+Timings might be something like this: ``` -tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}}; - nchunks::Int = nthreads(), - schedule::Symbol =:dynamic) +438.394 ms (7 allocations: 624 bytes) +88.050 ms (37 allocations: 3.02 KiB) ``` -A multithreaded function like `Base.collect`. Essentially just calls `tmap` on the generator function and inputs. The optional argument `OutputElementType` will select a specific element type for the returned container, and will generally incur fewer allocations than the version where `OutputElementType` is not specified. - -## Keyword arguments: - - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the `OutputElementType` argument is provided. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - +(Check out the full [Parallel Monte Carlo](https://juliafolds2.github.io/OhMyThreads.jl/stable/examples/mc/mc/) example if you like.) -

-

+## Documentation -____________________________ +For more information, please check out the [documentation](https://JuliaFolds2.github.io/OhMyThreads.jl/stable) of the latest release (or the [development version](https://JuliaFolds2.github.io/OhMyThreads.jl/dev) if you're curious). diff --git a/docs/src/index.md b/docs/src/index.md index a302095d..52ab6ebc 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -1,10 +1,9 @@ # OhMyThreads.jl -[OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl/) is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations via higher-order functions, with a -focus on [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism) without needing to expose julia's -[Task](https://docs.julialang.org/en/v1/base/parallel/) model to users. +[OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl/) is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a +focus on [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism), that can be used without having to worry much about manual [Task](https://docs.julialang.org/en/v1/base/parallel/) creation. -## Installation +## Quick Start The package is registered. Hence, you can simply use ``` @@ -12,10 +11,42 @@ The package is registered. Hence, you can simply use ``` to add the package to your Julia environment. -## Noteworthy Alternatives +### Basic example -* [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) -* [Folds.jl](https://github.com/JuliaFolds/Folds.jl) +```julia +using OhMyThreads + +function mc_parallel(N; kw...) + M = tmapreduce(+, 1:N; kw...) do i + rand()^2 + rand()^2 < 1.0 + end + pi = 4 * M / N + return pi +end + +N = 100_000_000 +mc_parallel(N) # gives, e.g., 3.14159924 + +using BenchmarkTools + +@assert Threads.nthreads() == 5 + +@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread +@btime mc_parallel($N) # running with all 5 Julia threads +``` + +Timings might be something like this: + +``` +438.394 ms (7 allocations: 624 bytes) +88.050 ms (37 allocations: 3.02 KiB) +``` + +(Check out the full [Parallel Monte Carlo](@ref) example if you like.) + +## No Transducers + +Unlike most [JuliaFolds2](https://github.com/JuliaFolds2) packages, OhMyThreads.jl is not built off of [Transducers.jl](https://github.com/JuliaFolds2/Transducers.jl), nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) or [Folds.jl](https://github.com/JuliaFolds2/Folds.jl). ## Acknowledgements From f6d17f5ca92ca97f1d596d4502171f88fb76c2c9 Mon Sep 17 00:00:00 2001 From: Carsten Bauer Date: Fri, 2 Feb 2024 13:27:50 +0100 Subject: [PATCH 02/12] assert -> show --- README.md | 2 +- docs/src/index.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index e5f9a1c0..a00c6227 100644 --- a/README.md +++ b/README.md @@ -55,7 +55,7 @@ mc_parallel(N) # gives, e.g., 3.14159924 using BenchmarkTools -@assert Threads.nthreads() == 5 +@show Threads.nthreads() # 5 in this example @btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread @btime mc_parallel($N) # running with all 5 Julia threads diff --git a/docs/src/index.md b/docs/src/index.md index 52ab6ebc..10ea194b 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -29,7 +29,7 @@ mc_parallel(N) # gives, e.g., 3.14159924 using BenchmarkTools -@assert Threads.nthreads() == 5 +@show Threads.nthreads() # 5 in this example @btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread @btime mc_parallel($N) # running with all 5 Julia threads From e0a4cff1115967964809bd4a528d7c4eb2c7109d Mon Sep 17 00:00:00 2001 From: Carsten Bauer Date: Fri, 2 Feb 2024 13:54:14 +0100 Subject: [PATCH 03/12] translation page for docs --- docs/make.jl | 2 + docs/src/translation.md | 119 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 121 insertions(+) create mode 100644 docs/src/translation.md diff --git a/docs/make.jl b/docs/make.jl index 12c8760e..94024cfb 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -12,6 +12,8 @@ makedocs(; doctest = false, pages = [ "OhMyThreads" => "index.md", + # "Getting Started" => "examples/getting_started.md", + "Translation" => "translation.md", "Examples" => [ "Parallel Monte Carlo" => "examples/mc/mc.md", "Julia Set" => "examples/juliaset/juliaset.md", diff --git a/docs/src/translation.md b/docs/src/translation.md new file mode 100644 index 00000000..67a30364 --- /dev/null +++ b/docs/src/translation.md @@ -0,0 +1,119 @@ +# Translation + +## Basic + +### `@threads` + +```julia +# Base.Threads +@threads for i in 1:10 + println(i) +end +``` + +```julia +# OhMyThreads +tforeach(1:10) do i + println(i) +end +``` + +#### `:static` scheduling + +```julia +# Base.Threads +@threads :static for i in 1:10 + println(i) +end +``` + +```julia +# OhMyThreads +tforeach(1:10; schedule=:static) do i + println(i) +end +``` + +### `@spawn` + +```julia +# Base.Threads +@sync for i in 1:10 + @spawn println(i) +end +``` + +```julia +# OhMyThreads +tforeach(1:10; nchunks=10) do i + println(i) +end +``` + +## Reduction + +No built-in feature in Base.Threads. + +```julia +# Base.Threads: basic manual implementation +data = rand(10) +chunks_itr = Iterators.partition(data, length(data) ÷ nthreads()) +tasks = map(chunks_itr) do chunk + @spawn reduce(+, chunk) +end +reduce(+, fetch.(tasks)) +``` + +```julia +# OhMyThreads +data = rand(10) +treduce(+, data) +``` + +## Mutation + +TODO: Remark why one has to be careful here (e.g. false sharing). + +```julia +# Base.Threads +data = rand(10) +@threads for i in 1:10 + data[i] = calc(i) +end +``` + +```julia +# OhMyThreads: Variant 1 +data = rand(10) +tforeach(data) do i + data[i] = calc(i) +end +``` + +```julia +# OhMyThreads: Variant 2 +data = rand(10) +tmap!(data, data) do i # TODO: comment on aliasing + calc(i) +end +``` + +## Parallel initialization + +```julia +# Base.Threads +data = Vector{Float64}(undef, 10) +@threads for i in 1:10 + data[i] = calc(i) +end +``` + +```julia +# OhMyThreads: Variant 1 +data = tmap(i->calc(i), 1:10) +``` + +```julia +# OhMyThreads: Variant 2 +data = tcollect(calc(i) for i in 1:10) +``` \ No newline at end of file From cf0b57865b5a0250f53dd85584a5d3678ebd21a3 Mon Sep 17 00:00:00 2001 From: Carsten Bauer Date: Fri, 2 Feb 2024 14:15:47 +0100 Subject: [PATCH 04/12] fix codecov badge --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a00c6227..cf7cff12 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ [ci-img]: https://github.com/JuliaFolds2/OhMyThreads.jl/actions/workflows/ci.yml/badge.svg [ci-url]: https://github.com/JuliaFolds2/OhMyThreads.jl/actions/workflows/ci.yml -[cov-img]: https://codecov.io/gh/JuliaFolds2/OhMyThreads.jl/branch/main/graph/badge.svg?token=Ze61CbGoO5 +[cov-img]: https://codecov.io/gh/JuliaFolds2/OhMyThreads.jl/branch/master/graph/badge.svg [cov-url]: https://codecov.io/gh/JuliaFolds2/OhMyThreads.jl [lifecycle-img]: https://img.shields.io/badge/lifecycle-experimental-red.svg From e31e11b9b408c8bd00b4fbd49f15252deccc9d86 Mon Sep 17 00:00:00 2001 From: Carsten Bauer Date: Fri, 2 Feb 2024 14:24:30 +0100 Subject: [PATCH 05/12] minor improvements --- docs/make.jl | 2 +- docs/src/refs/api.md | 2 +- docs/src/translation.md | 8 ++++++-- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/docs/make.jl b/docs/make.jl index 94024cfb..91f343b9 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -13,7 +13,6 @@ makedocs(; pages = [ "OhMyThreads" => "index.md", # "Getting Started" => "examples/getting_started.md", - "Translation" => "translation.md", "Examples" => [ "Parallel Monte Carlo" => "examples/mc/mc.md", "Julia Set" => "examples/juliaset/juliaset.md", @@ -21,6 +20,7 @@ makedocs(; # "Explanations" => [ # "B" => "explanations/B.md", # ], + "Translation Guide" => "translation.md", "References" => [ "Public API" => "refs/api.md", "Internal" => "refs/internal.md", diff --git a/docs/src/refs/api.md b/docs/src/refs/api.md index 1d0fafce..01aeb64a 100644 --- a/docs/src/refs/api.md +++ b/docs/src/refs/api.md @@ -1,4 +1,4 @@ -# Public API +# [Public API](@id API) ## Index diff --git a/docs/src/translation.md b/docs/src/translation.md index 67a30364..881230dd 100644 --- a/docs/src/translation.md +++ b/docs/src/translation.md @@ -1,6 +1,10 @@ -# Translation +# Translation Guide -## Basic +This page tries to give a general overview of how to translate patterns written with the built-in tools of [Base.Threads](https://docs.julialang.org/en/v1/base/multi-threading/) using the [OhMyThreads.jl API](@ref API). + +Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into the [OhMyThreads.jl API](@ref API). + +## Basics ### `@threads` From e8a6907f415218b78ee92efc11332992855d1871 Mon Sep 17 00:00:00 2001 From: Carsten Bauer Date: Fri, 2 Feb 2024 14:33:11 +0100 Subject: [PATCH 06/12] warning about mutation --- docs/src/translation.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/src/translation.md b/docs/src/translation.md index 881230dd..54a926c9 100644 --- a/docs/src/translation.md +++ b/docs/src/translation.md @@ -76,7 +76,8 @@ treduce(+, data) ## Mutation -TODO: Remark why one has to be careful here (e.g. false sharing). +!!! warning + Parallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. [false sharing](https://en.wikipedia.org/wiki/False_sharing#:~:text=False%20sharing%20is%20an%20inherent,is%20limited%20to%20RAM%20caches.)). You should carefully consider whether this is necessary or whether the use of task-local storage is the better option. ```julia # Base.Threads @@ -97,7 +98,7 @@ end ```julia # OhMyThreads: Variant 2 data = rand(10) -tmap!(data, data) do i # TODO: comment on aliasing +tmap!(data, data) do i # this kind of aliasing is fine calc(i) end ``` From 0a8455956e4f75e24620fea74ac15ed3b759ae47 Mon Sep 17 00:00:00 2001 From: Carsten Bauer Date: Fri, 2 Feb 2024 14:36:22 +0100 Subject: [PATCH 07/12] final stroke --- docs/src/translation.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/src/translation.md b/docs/src/translation.md index 54a926c9..09da1c5e 100644 --- a/docs/src/translation.md +++ b/docs/src/translation.md @@ -1,8 +1,6 @@ # Translation Guide -This page tries to give a general overview of how to translate patterns written with the built-in tools of [Base.Threads](https://docs.julialang.org/en/v1/base/multi-threading/) using the [OhMyThreads.jl API](@ref API). - -Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into the [OhMyThreads.jl API](@ref API). +This page tries to give a general overview of how to translate patterns written with the built-in tools of [Base.Threads](https://docs.julialang.org/en/v1/base/multi-threading/) using the [OhMyThreads.jl API](@ref API). Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl. ## Basics From d03954c78b9c9430a693f2b9b59f6085a5a12eac Mon Sep 17 00:00:00 2001 From: Carsten Bauer Date: Fri, 2 Feb 2024 15:37:11 +0100 Subject: [PATCH 08/12] improve docstrings and close #27 --- src/OhMyThreads.jl | 140 +++++++++++++++++++++++++++++++++------------ 1 file changed, 103 insertions(+), 37 deletions(-) diff --git a/src/OhMyThreads.jl b/src/OhMyThreads.jl index cd10771f..c0cb8a10 100644 --- a/src/OhMyThreads.jl +++ b/src/OhMyThreads.jl @@ -14,21 +14,29 @@ export chunks, treduce, tmapreduce, treducemap, tmap, tmap!, tforeach, tcollect schedule::Symbol =:dynamic, outputtype::Type = Any) -A multithreaded function like `Base.mapreduce`. Perform a reduction over `A`, applying a single-argument -function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an -[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that -`op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined -results. +A multithreaded function like `Base.mapreduce`. Perform a reduction over `A`, applying a +single-argument function `f` to each element, and then combining them with the two-argument +function `op`. -For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing +Note that `op` **must** be an +[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense +that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you +will get undefined results. + +For parallelization, the data is divided into chunks and a parallel task is created per +chunk. + +To see the keyword argument options, check out `??tmapreduce`. + +## Example: tmapreduce(√, +, [1, 2, 3, 4, 5]) -is the parallelized version of +is the parallelized version of `sum(√, [1, 2, 3, 4, 5])` in the form (√1 + √2) + (√3 + √4) + √5 -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). +# Extended help ## Keyword arguments: @@ -53,22 +61,30 @@ function tmapreduce end schedule::Symbol =:dynamic, outputtype::Type = Any) -Like `tmapreduce` except the order of the `f` and `op` arguments are switched. This is sometimes convenient with `do`-block notation. -Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument -function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, -in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will -get undefined results. +Like `tmapreduce` except the order of the `f` and `op` arguments are switched. This is +sometimes convenient with `do`-block notation. Perform a reduction over `A`, applying a +single-argument function `f` to each element, and then combining them with the two-argument +function `op`. -For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing +Note that `op` **must** be an +[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense +that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you +will get undefined results. - treducemap(+, √, [1, 2, 3, 4, 5]) +For parallelization, the data is divided into chunks and a parallel task is created per +chunk. -is the parallelized version of +To see the keyword argument options, check out `??treducemap`. - (√1 + √2) + (√3 + √4) + √5 +## Example: + + tmapreduce(√, +, [1, 2, 3, 4, 5]) + +is the parallelized version of `sum(√, [1, 2, 3, 4, 5])` in the form + (√1 + √2) + (√3 + √4) + √5 -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). +# Extended help ## Keyword arguments: @@ -94,21 +110,28 @@ function treducemap end schedule::Symbol =:dynamic, outputtype::Type = Any) -A multithreaded function like `Base.reduce`. Perform a reduction over `A` using the two-argument -function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, -in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will -get undefined results. +A multithreaded function like `Base.reduce`. Perform a reduction over `A` using the +two-argument function `op`. + +Note that `op` **must** be an +[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense +that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you +will get undefined results. -For a very well known example of `reduce`, `sum(A)` is equivalent to `reduce(+, A)`. Doing +For parallelization, the data is divided into chunks and a parallel task is created per +chunk. - treduce(+, [1, 2, 3, 4, 5]) +To see the keyword argument options, check out `??treduce`. -is the parallelized version of +## Example: - (1 + 2) + (3 + 4) + 5 + treduce(+, [1, 2, 3, 4, 5]) +is the parallelized version of `sum([1, 2, 3, 4, 5])` in the form -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). + (1 + 2) + (3 + 4) + 5 + +# Extended help ## Keyword arguments: @@ -131,12 +154,26 @@ function treduce end split::Symbol = :batch, schedule::Symbol =:dynamic) :: Nothing -A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` on multiple parallel tasks, and return `nothing`, i.e. it is the parallel equivalent of +A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` on +multiple parallel tasks, and return `nothing`. I.e. it is the parallel equivalent of for x in A f(x) end +For parallelization, the data is divided into chunks and a parallel task is created per +chunk. + +To see the keyword argument options, check out `??tforeach`. + +## Example: + + tforeach(1:10) do i + println(i^2) + end + +# Extended help + ## Keyword arguments: - `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. @@ -150,16 +187,26 @@ A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` o function tforeach end """ - tmap(f, [OutputElementType], A::AbstractArray...; + tmap(f, [OutputElementType], A::AbstractArray...; nchunks::Int = nthreads(), split::Symbol = :batch, schedule::Symbol =:dynamic) -A multithreaded function like `Base.map`. Create a new container `similar` to `A` whose `i`th element is -equal to `f(A[i])`. This container is filled in parallel on multiple tasks. The optional argument -`OutputElementType` will select a specific element type for the returned container, and will generally incur -fewer allocations than the version where `OutputElementType` is not specified. +A multithreaded function like `Base.map`. Create a new container `similar` to `A` whose +`i`th element is equal to `f(A[i])`. This container is filled in parallel: the data is +divided into chunks and a parallel task is created per chunk. + +The optional argument `OutputElementType` will select a specific element type for the +returned container, and will generally incur fewer allocations than the version where +`OutputElementType` is not specified. + +To see the keyword argument options, check out `??tmap`. + +## Example: + tmap(sin, 1:10) + +# Extended help ## Keyword arguments: @@ -179,8 +226,15 @@ function tmap end split::Symbol = :batch, schedule::Symbol =:dynamic) -A multithreaded function like `Base.map!`. In parallel on multiple tasks, this function assigns each element -of `out[i] = f(A[i])` for each index `i` of `A` and `out`. +A multithreaded function like `Base.map!`. In parallel on multiple tasks, this function +assigns each element of `out[i] = f(A[i])` for each index `i` of `A` and `out`. + +For parallelization, the data is divided into chunks and a parallel task is created per +chunk. + +To see the keyword argument options, check out `??tmap!`. + +# Extended help ## Keyword arguments: @@ -199,8 +253,20 @@ function tmap! end nchunks::Int = nthreads(), schedule::Symbol =:dynamic) -A multithreaded function like `Base.collect`. Essentially just calls `tmap` on the generator function and -inputs. The optional argument `OutputElementType` will select a specific element type for the returned container, and will generally incur fewer allocations than the version where `OutputElementType` is not specified. +A multithreaded function like `Base.collect`. Essentially just calls `tmap` on the +generator function and inputs. + +The optional argument `OutputElementType` will select a specific element type for the +returned container, and will generally incur fewer allocations than the version where +`OutputElementType` is not specified. + +To see the keyword argument options, check out `??tcollect`. + +## Example: + + tcollect(sin(i) for i in 1:10) + +# Extended help ## Keyword arguments: From 441e0f88dcd97aafb480c1616cd0312f4d51c2de Mon Sep 17 00:00:00 2001 From: Hendrik Ranocha Date: Fri, 2 Feb 2024 15:55:38 +0100 Subject: [PATCH 09/12] use julia-actions/cache instead of actions/cache --- .github/workflows/ci.yml | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index d04c5626..e9e79b90 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -23,16 +23,7 @@ jobs: with: version: ${{ matrix.version }} arch: ${{ matrix.arch }} - - uses: actions/cache@v1 - env: - cache-name: cache-artifacts - with: - path: ~/.julia/artifacts - key: ${{ runner.os }}-test-${{ env.cache-name }}-${{ hashFiles('**/Project.toml') }} - restore-keys: | - ${{ runner.os }}-test-${{ env.cache-name }}- - ${{ runner.os }}-test- - ${{ runner.os }}- + - uses: julia-actions/cache@v1 - uses: julia-actions/julia-buildpkg@v1 - uses: julia-actions/julia-runtest@v1 - uses: julia-actions/julia-processcoverage@v1 From 41c2b253acfb09b4566cb843eb204a16148a73dc Mon Sep 17 00:00:00 2001 From: Hendrik Ranocha Date: Fri, 2 Feb 2024 15:57:46 +0100 Subject: [PATCH 10/12] Create dependabot.yml --- .github/dependabot.yml | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 .github/dependabot.yml diff --git a/.github/dependabot.yml b/.github/dependabot.yml new file mode 100644 index 00000000..d60f0707 --- /dev/null +++ b/.github/dependabot.yml @@ -0,0 +1,7 @@ +# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates +version: 2 +updates: + - package-ecosystem: "github-actions" + directory: "/" # Location of package manifests + schedule: + interval: "monthly" From d6b646da48f590ee327532179ca09b8c91044e0a Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 2 Feb 2024 16:22:42 +0000 Subject: [PATCH 11/12] Bump actions/checkout from 2 to 4 Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v2...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] --- .github/workflows/ci.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index e9e79b90..078199d2 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -18,7 +18,7 @@ jobs: arch: - x64 steps: - - uses: actions/checkout@v2 + - uses: actions/checkout@v4 - uses: julia-actions/setup-julia@v1 with: version: ${{ matrix.version }} From 296d7d85b1271b884bfde6bd15204e4cf1541dc5 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 2 Feb 2024 16:22:45 +0000 Subject: [PATCH 12/12] Bump codecov/codecov-action from 1 to 4 Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 1 to 4. - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](https://github.com/codecov/codecov-action/compare/v1...v4) --- updated-dependencies: - dependency-name: codecov/codecov-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] --- .github/workflows/ci.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index e9e79b90..0b696321 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -27,7 +27,7 @@ jobs: - uses: julia-actions/julia-buildpkg@v1 - uses: julia-actions/julia-runtest@v1 - uses: julia-actions/julia-processcoverage@v1 - - uses: codecov/codecov-action@v1 + - uses: codecov/codecov-action@v4 with: file: lcov.info env: