diff --git a/.github/dependabot.yml b/.github/dependabot.yml new file mode 100644 index 00000000..d60f0707 --- /dev/null +++ b/.github/dependabot.yml @@ -0,0 +1,7 @@ +# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates +version: 2 +updates: + - package-ecosystem: "github-actions" + directory: "/" # Location of package manifests + schedule: + interval: "monthly" diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index d04c5626..7f59ab23 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -18,25 +18,16 @@ jobs: arch: - x64 steps: - - uses: actions/checkout@v2 + - uses: actions/checkout@v4 - uses: julia-actions/setup-julia@v1 with: version: ${{ matrix.version }} arch: ${{ matrix.arch }} - - uses: actions/cache@v1 - env: - cache-name: cache-artifacts - with: - path: ~/.julia/artifacts - key: ${{ runner.os }}-test-${{ env.cache-name }}-${{ hashFiles('**/Project.toml') }} - restore-keys: | - ${{ runner.os }}-test-${{ env.cache-name }}- - ${{ runner.os }}-test- - ${{ runner.os }}- + - uses: julia-actions/cache@v1 - uses: julia-actions/julia-buildpkg@v1 - uses: julia-actions/julia-runtest@v1 - uses: julia-actions/julia-processcoverage@v1 - - uses: codecov/codecov-action@v1 + - uses: codecov/codecov-action@v4 with: file: lcov.info env: diff --git a/README.md b/README.md index 1a936e69..cf7cff12 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ [ci-img]: https://github.com/JuliaFolds2/OhMyThreads.jl/actions/workflows/ci.yml/badge.svg [ci-url]: https://github.com/JuliaFolds2/OhMyThreads.jl/actions/workflows/ci.yml -[cov-img]: https://codecov.io/gh/JuliaFolds2/OhMyThreads.jl/branch/main/graph/badge.svg?token=Ze61CbGoO5 +[cov-img]: https://codecov.io/gh/JuliaFolds2/OhMyThreads.jl/branch/master/graph/badge.svg [cov-url]: https://codecov.io/gh/JuliaFolds2/OhMyThreads.jl [lifecycle-img]: https://img.shields.io/badge/lifecycle-experimental-red.svg @@ -32,283 +32,45 @@ |:-------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------:| | [![][docs-stable-img]][docs-stable-url] [![][docs-dev-img]][docs-dev-url] | [![][ci-img]][ci-url] [![][cov-img]][cov-url] | ![][lifecycle-img] | -This is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel -multithreaded calculations via higher-order functions, with a focus on -[data parallelism](https://en.wikipedia.org/wiki/Data_parallelism) without needing to expose julia's -[Task](https://docs.julialang.org/en/v1/base/parallel/) model to users. +[OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl/) is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a +focus on [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism), that can be used without having to worry much about manual [Task](https://docs.julialang.org/en/v1/base/parallel/) creation. -Unlike most JuliaFolds2 packages, it is not built off of -[Transducers.jl](https://github.com/JuliaFolds2/Transducers.jl), nor is it a building block for Transducers.jl. -Rather, OhMyThreads is meant to be a simpler, more maintainable, and more accessible alternative to packages -like [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) or [Folds.jl](https://github.com/JuliaFolds2/Folds.jl). +Unlike most [JuliaFolds2](https://github.com/JuliaFolds2) packages, OhMyThreads.jl is not built off of [Transducers.jl](https://github.com/JuliaFolds2/Transducers.jl), nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) or [Folds.jl](https://github.com/JuliaFolds2/Folds.jl). -OhMyThreads.jl re-exports the function `chunks` from -[ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl), and provides the following functions: +## Example -
tmapreduce -

+```julia +using OhMyThreads -``` -tmapreduce(f, op, A::AbstractArray...; - [init], - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic, - outputtype::Type = Any) -``` - -A multithreaded function like `Base.mapreduce`. Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined results. - -For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing - -``` - tmapreduce(√, +, [1, 2, 3, 4, 5]) -``` - -is the parallelized version of - -``` - (√1 + √2) + (√3 + √4) + √5 -``` - -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). - -## Keyword arguments: - - * `init` optional keyword argument forwarded to `mapreduce` for the sequential parts of the calculation. - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of `A` in a non-deterministic order, and thus your reducing `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! This schedule will however work with non-`AbstractArray` iterables. If you use the `:greedy` scheduler, we strongly recommend you provide an `init` keyword argument. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - * `outputtype::Type` (default `Any`) will work as the asserted output type of parallel calculations. This is typically only - -needed if you are using a `:static` schedule, since the `:dynamic` schedule is uses [StableTasks.jl](https://github.com/MasonProtter/StableTasks.jl), but if you experience problems with type stability, you may be able to recover it with the `outputtype` keyword argument. - - -

-

- -____________________________ - -
treducemap -

- -``` -treducemap(op, f, A::AbstractArray...; - [init], - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic, - outputtype::Type = Any) -``` - -Like `tmapreduce` except the order of the `f` and `op` arguments are switched. This is sometimes convenient with `do`-block notation. Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined results. - -For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing - -``` - treducemap(+, √, [1, 2, 3, 4, 5]) -``` - -is the parallelized version of - -``` - (√1 + √2) + (√3 + √4) + √5 -``` - -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). - -## Keyword arguments: - - * `init` optional keyword argument forwarded to `mapreduce` for the sequential parts of the calculation. - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of `A` in a non-deterministic order, and thus your reducing `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! This schedule will however work with non-`AbstractArray` iterables. If you use the `:greedy` scheduler, we strongly recommend you provide an `init` keyword argument. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - * `outputtype::Type` (default `Any`) will work as the asserted output type of parallel calculations. This is typically only - -needed if you are using a `:static` schedule, since the `:dynamic` schedule is uses [StableTasks.jl](https://github.com/MasonProtter/StableTasks.jl), but if you experience problems with type stability, you may be able to recover it with the `outputtype` keyword argument. - - -

-

- -____________________________ - -
treduce -

- -``` -treduce(op, A::AbstractArray...; - [init], - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic, - outputtype::Type = Any) -``` - -Like `tmapreduce` except the order of the `f` and `op` arguments are switched. Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined results. - -For a very well known example of `reduce`, `sum(A)` is equivalent to `reduce(+, A)`. Doing - -``` - treduce(+, [1, 2, 3, 4, 5]) -``` - -is the parallelized version of - -``` - (1 + 2) + (3 + 4) + 5 -``` - -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). - -## Keyword arguments: - - * `init` optional keyword argument forwarded to `mapreduce` for the sequential parts of the calculation. - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of `A` in a non-deterministic order, and thus your reducing `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! This schedule will however work with non-`AbstractArray` iterables. If you use the `:greedy` scheduler, we strongly recommend you provide an `init` keyword argument. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - * `outputtype::Type` (default `Any`) will work as the asserted output type of parallel calculations. This is typically only - -needed if you are using a `:static` schedule, since the `:dynamic` schedule is uses [StableTasks.jl](https://github.com/MasonProtter/StableTasks.jl), but if you experience problems with type stability, you may be able to recover it with the `outputtype` keyword argument. - - -

-

- -____________________________ - -
tmap -

- -``` -tmap(f, [OutputElementType], A::AbstractArray...; - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic) -``` - -A multithreaded function like `Base.map`. Create a new container `similar` to `A` whose `i`th element is equal to `f(A[i])`. This container is filled in parallel on multiple tasks. The optional argument `OutputElementType` will select a specific element type for the returned container, and will generally incur fewer allocations than the version where `OutputElementType` is not specified. - -## Keyword arguments: - - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the `OutputElementType` argument is provided. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - - -

-

- -____________________________ - -
tmap! -

- -``` -tmap!(f, out, A::AbstractArray...; - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic) -``` - -A multithreaded function like `Base.map!`. In parallel on multiple tasks, this function assigns each element of `out[i] = f(A[i])` for each index `i` of `A` and `out`. - -## Keyword arguments: - - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - - -

-

- -____________________________ - -
tforeach -

- -``` -tforeach(f, A::AbstractArray...; - nchunks::Int = nthreads(), - split::Symbol = :batch, - schedule::Symbol =:dynamic) :: Nothing -``` - -A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` on multiple parallel tasks, and return `nothing`, i.e. it is the parallel equivalent of - -``` -for x in A - f(x) +function mc_parallel(N; kw...) + M = tmapreduce(+, 1:N; kw...) do i + rand()^2 + rand()^2 < 1.0 + end + pi = 4 * M / N + return pi end -``` - -## Keyword arguments: - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of +N = 100_000_000 +mc_parallel(N) # gives, e.g., 3.14159924 - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. +using BenchmarkTools +@show Threads.nthreads() # 5 in this example -

-

- -____________________________ +@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread +@btime mc_parallel($N) # running with all 5 Julia threads +``` -
tcollect -

+Timings might be something like this: ``` -tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}}; - nchunks::Int = nthreads(), - schedule::Symbol =:dynamic) +438.394 ms (7 allocations: 624 bytes) +88.050 ms (37 allocations: 3.02 KiB) ``` -A multithreaded function like `Base.collect`. Essentially just calls `tmap` on the generator function and inputs. The optional argument `OutputElementType` will select a specific element type for the returned container, and will generally incur fewer allocations than the version where `OutputElementType` is not specified. - -## Keyword arguments: - - * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. - * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of - - * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system. - * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time. - * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the `OutputElementType` argument is provided. - * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool. - +(Check out the full [Parallel Monte Carlo](https://juliafolds2.github.io/OhMyThreads.jl/stable/examples/mc/mc/) example if you like.) -

-

+## Documentation -____________________________ +For more information, please check out the [documentation](https://JuliaFolds2.github.io/OhMyThreads.jl/stable) of the latest release (or the [development version](https://JuliaFolds2.github.io/OhMyThreads.jl/dev) if you're curious). diff --git a/docs/make.jl b/docs/make.jl index 12c8760e..91f343b9 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -12,6 +12,7 @@ makedocs(; doctest = false, pages = [ "OhMyThreads" => "index.md", + # "Getting Started" => "examples/getting_started.md", "Examples" => [ "Parallel Monte Carlo" => "examples/mc/mc.md", "Julia Set" => "examples/juliaset/juliaset.md", @@ -19,6 +20,7 @@ makedocs(; # "Explanations" => [ # "B" => "explanations/B.md", # ], + "Translation Guide" => "translation.md", "References" => [ "Public API" => "refs/api.md", "Internal" => "refs/internal.md", diff --git a/docs/src/index.md b/docs/src/index.md index a302095d..10ea194b 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -1,10 +1,9 @@ # OhMyThreads.jl -[OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl/) is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations via higher-order functions, with a -focus on [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism) without needing to expose julia's -[Task](https://docs.julialang.org/en/v1/base/parallel/) model to users. +[OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl/) is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a +focus on [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism), that can be used without having to worry much about manual [Task](https://docs.julialang.org/en/v1/base/parallel/) creation. -## Installation +## Quick Start The package is registered. Hence, you can simply use ``` @@ -12,10 +11,42 @@ The package is registered. Hence, you can simply use ``` to add the package to your Julia environment. -## Noteworthy Alternatives +### Basic example -* [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) -* [Folds.jl](https://github.com/JuliaFolds/Folds.jl) +```julia +using OhMyThreads + +function mc_parallel(N; kw...) + M = tmapreduce(+, 1:N; kw...) do i + rand()^2 + rand()^2 < 1.0 + end + pi = 4 * M / N + return pi +end + +N = 100_000_000 +mc_parallel(N) # gives, e.g., 3.14159924 + +using BenchmarkTools + +@show Threads.nthreads() # 5 in this example + +@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread +@btime mc_parallel($N) # running with all 5 Julia threads +``` + +Timings might be something like this: + +``` +438.394 ms (7 allocations: 624 bytes) +88.050 ms (37 allocations: 3.02 KiB) +``` + +(Check out the full [Parallel Monte Carlo](@ref) example if you like.) + +## No Transducers + +Unlike most [JuliaFolds2](https://github.com/JuliaFolds2) packages, OhMyThreads.jl is not built off of [Transducers.jl](https://github.com/JuliaFolds2/Transducers.jl), nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) or [Folds.jl](https://github.com/JuliaFolds2/Folds.jl). ## Acknowledgements diff --git a/docs/src/refs/api.md b/docs/src/refs/api.md index 1d0fafce..01aeb64a 100644 --- a/docs/src/refs/api.md +++ b/docs/src/refs/api.md @@ -1,4 +1,4 @@ -# Public API +# [Public API](@id API) ## Index diff --git a/docs/src/translation.md b/docs/src/translation.md new file mode 100644 index 00000000..09da1c5e --- /dev/null +++ b/docs/src/translation.md @@ -0,0 +1,122 @@ +# Translation Guide + +This page tries to give a general overview of how to translate patterns written with the built-in tools of [Base.Threads](https://docs.julialang.org/en/v1/base/multi-threading/) using the [OhMyThreads.jl API](@ref API). Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl. + +## Basics + +### `@threads` + +```julia +# Base.Threads +@threads for i in 1:10 + println(i) +end +``` + +```julia +# OhMyThreads +tforeach(1:10) do i + println(i) +end +``` + +#### `:static` scheduling + +```julia +# Base.Threads +@threads :static for i in 1:10 + println(i) +end +``` + +```julia +# OhMyThreads +tforeach(1:10; schedule=:static) do i + println(i) +end +``` + +### `@spawn` + +```julia +# Base.Threads +@sync for i in 1:10 + @spawn println(i) +end +``` + +```julia +# OhMyThreads +tforeach(1:10; nchunks=10) do i + println(i) +end +``` + +## Reduction + +No built-in feature in Base.Threads. + +```julia +# Base.Threads: basic manual implementation +data = rand(10) +chunks_itr = Iterators.partition(data, length(data) ÷ nthreads()) +tasks = map(chunks_itr) do chunk + @spawn reduce(+, chunk) +end +reduce(+, fetch.(tasks)) +``` + +```julia +# OhMyThreads +data = rand(10) +treduce(+, data) +``` + +## Mutation + +!!! warning + Parallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. [false sharing](https://en.wikipedia.org/wiki/False_sharing#:~:text=False%20sharing%20is%20an%20inherent,is%20limited%20to%20RAM%20caches.)). You should carefully consider whether this is necessary or whether the use of task-local storage is the better option. + +```julia +# Base.Threads +data = rand(10) +@threads for i in 1:10 + data[i] = calc(i) +end +``` + +```julia +# OhMyThreads: Variant 1 +data = rand(10) +tforeach(data) do i + data[i] = calc(i) +end +``` + +```julia +# OhMyThreads: Variant 2 +data = rand(10) +tmap!(data, data) do i # this kind of aliasing is fine + calc(i) +end +``` + +## Parallel initialization + +```julia +# Base.Threads +data = Vector{Float64}(undef, 10) +@threads for i in 1:10 + data[i] = calc(i) +end +``` + +```julia +# OhMyThreads: Variant 1 +data = tmap(i->calc(i), 1:10) +``` + +```julia +# OhMyThreads: Variant 2 +data = tcollect(calc(i) for i in 1:10) +``` \ No newline at end of file diff --git a/src/OhMyThreads.jl b/src/OhMyThreads.jl index cd10771f..c0cb8a10 100644 --- a/src/OhMyThreads.jl +++ b/src/OhMyThreads.jl @@ -14,21 +14,29 @@ export chunks, treduce, tmapreduce, treducemap, tmap, tmap!, tforeach, tcollect schedule::Symbol =:dynamic, outputtype::Type = Any) -A multithreaded function like `Base.mapreduce`. Perform a reduction over `A`, applying a single-argument -function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an -[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that -`op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined -results. +A multithreaded function like `Base.mapreduce`. Perform a reduction over `A`, applying a +single-argument function `f` to each element, and then combining them with the two-argument +function `op`. -For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing +Note that `op` **must** be an +[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense +that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you +will get undefined results. + +For parallelization, the data is divided into chunks and a parallel task is created per +chunk. + +To see the keyword argument options, check out `??tmapreduce`. + +## Example: tmapreduce(√, +, [1, 2, 3, 4, 5]) -is the parallelized version of +is the parallelized version of `sum(√, [1, 2, 3, 4, 5])` in the form (√1 + √2) + (√3 + √4) + √5 -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). +# Extended help ## Keyword arguments: @@ -53,22 +61,30 @@ function tmapreduce end schedule::Symbol =:dynamic, outputtype::Type = Any) -Like `tmapreduce` except the order of the `f` and `op` arguments are switched. This is sometimes convenient with `do`-block notation. -Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument -function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, -in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will -get undefined results. +Like `tmapreduce` except the order of the `f` and `op` arguments are switched. This is +sometimes convenient with `do`-block notation. Perform a reduction over `A`, applying a +single-argument function `f` to each element, and then combining them with the two-argument +function `op`. -For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing +Note that `op` **must** be an +[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense +that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you +will get undefined results. - treducemap(+, √, [1, 2, 3, 4, 5]) +For parallelization, the data is divided into chunks and a parallel task is created per +chunk. -is the parallelized version of +To see the keyword argument options, check out `??treducemap`. - (√1 + √2) + (√3 + √4) + √5 +## Example: + + tmapreduce(√, +, [1, 2, 3, 4, 5]) + +is the parallelized version of `sum(√, [1, 2, 3, 4, 5])` in the form + (√1 + √2) + (√3 + √4) + √5 -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). +# Extended help ## Keyword arguments: @@ -94,21 +110,28 @@ function treducemap end schedule::Symbol =:dynamic, outputtype::Type = Any) -A multithreaded function like `Base.reduce`. Perform a reduction over `A` using the two-argument -function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, -in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will -get undefined results. +A multithreaded function like `Base.reduce`. Perform a reduction over `A` using the +two-argument function `op`. + +Note that `op` **must** be an +[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense +that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you +will get undefined results. -For a very well known example of `reduce`, `sum(A)` is equivalent to `reduce(+, A)`. Doing +For parallelization, the data is divided into chunks and a parallel task is created per +chunk. - treduce(+, [1, 2, 3, 4, 5]) +To see the keyword argument options, check out `??treduce`. -is the parallelized version of +## Example: - (1 + 2) + (3 + 4) + 5 + treduce(+, [1, 2, 3, 4, 5]) +is the parallelized version of `sum([1, 2, 3, 4, 5])` in the form -This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl). + (1 + 2) + (3 + 4) + 5 + +# Extended help ## Keyword arguments: @@ -131,12 +154,26 @@ function treduce end split::Symbol = :batch, schedule::Symbol =:dynamic) :: Nothing -A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` on multiple parallel tasks, and return `nothing`, i.e. it is the parallel equivalent of +A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` on +multiple parallel tasks, and return `nothing`. I.e. it is the parallel equivalent of for x in A f(x) end +For parallelization, the data is divided into chunks and a parallel task is created per +chunk. + +To see the keyword argument options, check out `??tforeach`. + +## Example: + + tforeach(1:10) do i + println(i^2) + end + +# Extended help + ## Keyword arguments: - `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead. @@ -150,16 +187,26 @@ A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` o function tforeach end """ - tmap(f, [OutputElementType], A::AbstractArray...; + tmap(f, [OutputElementType], A::AbstractArray...; nchunks::Int = nthreads(), split::Symbol = :batch, schedule::Symbol =:dynamic) -A multithreaded function like `Base.map`. Create a new container `similar` to `A` whose `i`th element is -equal to `f(A[i])`. This container is filled in parallel on multiple tasks. The optional argument -`OutputElementType` will select a specific element type for the returned container, and will generally incur -fewer allocations than the version where `OutputElementType` is not specified. +A multithreaded function like `Base.map`. Create a new container `similar` to `A` whose +`i`th element is equal to `f(A[i])`. This container is filled in parallel: the data is +divided into chunks and a parallel task is created per chunk. + +The optional argument `OutputElementType` will select a specific element type for the +returned container, and will generally incur fewer allocations than the version where +`OutputElementType` is not specified. + +To see the keyword argument options, check out `??tmap`. + +## Example: + tmap(sin, 1:10) + +# Extended help ## Keyword arguments: @@ -179,8 +226,15 @@ function tmap end split::Symbol = :batch, schedule::Symbol =:dynamic) -A multithreaded function like `Base.map!`. In parallel on multiple tasks, this function assigns each element -of `out[i] = f(A[i])` for each index `i` of `A` and `out`. +A multithreaded function like `Base.map!`. In parallel on multiple tasks, this function +assigns each element of `out[i] = f(A[i])` for each index `i` of `A` and `out`. + +For parallelization, the data is divided into chunks and a parallel task is created per +chunk. + +To see the keyword argument options, check out `??tmap!`. + +# Extended help ## Keyword arguments: @@ -199,8 +253,20 @@ function tmap! end nchunks::Int = nthreads(), schedule::Symbol =:dynamic) -A multithreaded function like `Base.collect`. Essentially just calls `tmap` on the generator function and -inputs. The optional argument `OutputElementType` will select a specific element type for the returned container, and will generally incur fewer allocations than the version where `OutputElementType` is not specified. +A multithreaded function like `Base.collect`. Essentially just calls `tmap` on the +generator function and inputs. + +The optional argument `OutputElementType` will select a specific element type for the +returned container, and will generally incur fewer allocations than the version where +`OutputElementType` is not specified. + +To see the keyword argument options, check out `??tcollect`. + +## Example: + + tcollect(sin(i) for i in 1:10) + +# Extended help ## Keyword arguments: