diff --git a/.github/dependabot.yml b/.github/dependabot.yml
new file mode 100644
index 00000000..d60f0707
--- /dev/null
+++ b/.github/dependabot.yml
@@ -0,0 +1,7 @@
+# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates
+version: 2
+updates:
+  - package-ecosystem: "github-actions"
+    directory: "/" # Location of package manifests
+    schedule:
+      interval: "monthly"
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index d04c5626..7f59ab23 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -18,25 +18,16 @@ jobs:
         arch:
           - x64
     steps:
-      - uses: actions/checkout@v2
+      - uses: actions/checkout@v4
       - uses: julia-actions/setup-julia@v1
         with:
           version: ${{ matrix.version }}
           arch: ${{ matrix.arch }}
-      - uses: actions/cache@v1
-        env:
-          cache-name: cache-artifacts
-        with:
-          path: ~/.julia/artifacts
-          key: ${{ runner.os }}-test-${{ env.cache-name }}-${{ hashFiles('**/Project.toml') }}
-          restore-keys: |
-            ${{ runner.os }}-test-${{ env.cache-name }}-
-            ${{ runner.os }}-test-
-            ${{ runner.os }}-
+      - uses: julia-actions/cache@v1
       - uses: julia-actions/julia-buildpkg@v1
       - uses: julia-actions/julia-runtest@v1
       - uses: julia-actions/julia-processcoverage@v1
-      - uses: codecov/codecov-action@v1
+      - uses: codecov/codecov-action@v4
         with:
           file: lcov.info
         env:
diff --git a/README.md b/README.md
index 1a936e69..cf7cff12 100644
--- a/README.md
+++ b/README.md
@@ -9,7 +9,7 @@
 [ci-img]: https://github.com/JuliaFolds2/OhMyThreads.jl/actions/workflows/ci.yml/badge.svg
 [ci-url]: https://github.com/JuliaFolds2/OhMyThreads.jl/actions/workflows/ci.yml
 
-[cov-img]: https://codecov.io/gh/JuliaFolds2/OhMyThreads.jl/branch/main/graph/badge.svg?token=Ze61CbGoO5
+[cov-img]: https://codecov.io/gh/JuliaFolds2/OhMyThreads.jl/branch/master/graph/badge.svg
 [cov-url]: https://codecov.io/gh/JuliaFolds2/OhMyThreads.jl
 
 [lifecycle-img]: https://img.shields.io/badge/lifecycle-experimental-red.svg
@@ -32,283 +32,45 @@
 |:-------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------:|
 | [![][docs-stable-img]][docs-stable-url] [![][docs-dev-img]][docs-dev-url] | [![][ci-img]][ci-url] [![][cov-img]][cov-url] | ![][lifecycle-img] |
 
-This is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel
-multithreaded calculations via higher-order functions, with a focus on
-[data parallelism](https://en.wikipedia.org/wiki/Data_parallelism) without needing to expose julia's
-[Task](https://docs.julialang.org/en/v1/base/parallel/) model to users.
+[OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl/) is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a
+focus on [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism), that can be used without having to worry much about manual [Task](https://docs.julialang.org/en/v1/base/parallel/) creation.
 
-Unlike most JuliaFolds2 packages, it is not built off of
-[Transducers.jl](https://github.com/JuliaFolds2/Transducers.jl), nor is it a building block for Transducers.jl.
-Rather, OhMyThreads is meant to be a simpler, more maintainable, and more accessible alternative to packages
-like [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) or [Folds.jl](https://github.com/JuliaFolds2/Folds.jl).
+Unlike most [JuliaFolds2](https://github.com/JuliaFolds2) packages, OhMyThreads.jl is not built off of [Transducers.jl](https://github.com/JuliaFolds2/Transducers.jl), nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) or [Folds.jl](https://github.com/JuliaFolds2/Folds.jl).
 
-OhMyThreads.jl re-exports the function `chunks` from
-[ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl), and provides the following functions:
+## Example
 
-<details><summary> tmapreduce </summary>
-<p>
+```julia
+using OhMyThreads
 
-```
-tmapreduce(f, op, A::AbstractArray...;
-           [init],
-           nchunks::Int = nthreads(),
-           split::Symbol = :batch,
-           schedule::Symbol =:dynamic,
-           outputtype::Type = Any)
-```
-
-A multithreaded function like `Base.mapreduce`. Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined results.
-
-For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing
-
-```
- tmapreduce(√, +, [1, 2, 3, 4, 5])
-```
-
-is the parallelized version of
-
-```
- (√1 + √2) + (√3 + √4) + √5
-```
-
-This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl).
-
-## Keyword arguments:
-
-  * `init` optional keyword argument forwarded to `mapreduce` for the sequential parts of the calculation.
-  * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead.
-  * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results!
-  * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of
-
-      * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
-      * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
-      * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of `A` in a non-deterministic order, and thus your reducing `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! This schedule will however work with non-`AbstractArray` iterables. If you use the `:greedy` scheduler, we strongly recommend you provide an `init` keyword argument.
-      * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool.
-  * `outputtype::Type` (default `Any`) will work as the asserted output type of parallel calculations. This is typically only
-
-needed if you are using a `:static` schedule, since the `:dynamic` schedule is uses [StableTasks.jl](https://github.com/MasonProtter/StableTasks.jl), but if you experience problems with type stability, you may be able to recover it with the `outputtype` keyword argument.
-
-
-</details>
-</p>
-
-____________________________
-
-<details><summary> treducemap </summary>
-<p>
-
-```
-treducemap(op, f, A::AbstractArray...;
-           [init],
-           nchunks::Int = nthreads(),
-           split::Symbol = :batch,
-           schedule::Symbol =:dynamic,
-           outputtype::Type = Any)
-```
-
-Like `tmapreduce` except the order of the `f` and `op` arguments are switched. This is sometimes convenient with `do`-block notation. Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined results.
-
-For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing
-
-```
- treducemap(+, √, [1, 2, 3, 4, 5])
-```
-
-is the parallelized version of
-
-```
- (√1 + √2) + (√3 + √4) + √5
-```
-
-This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl).
-
-## Keyword arguments:
-
-  * `init` optional keyword argument forwarded to `mapreduce` for the sequential parts of the calculation.
-  * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead.
-  * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results!
-  * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of
-
-      * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
-      * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
-      * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of `A` in a non-deterministic order, and thus your reducing `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! This schedule will however work with non-`AbstractArray` iterables. If you use the `:greedy` scheduler, we strongly recommend you provide an `init` keyword argument.
-      * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool.
-  * `outputtype::Type` (default `Any`) will work as the asserted output type of parallel calculations. This is typically only
-
-needed if you are using a `:static` schedule, since the `:dynamic` schedule is uses [StableTasks.jl](https://github.com/MasonProtter/StableTasks.jl), but if you experience problems with type stability, you may be able to recover it with the `outputtype` keyword argument.
-
-
-</details>
-</p>
-
-____________________________
-
-<details><summary> treduce </summary>
-<p>
-
-```
-treduce(op, A::AbstractArray...;
-        [init],
-        nchunks::Int = nthreads(),
-        split::Symbol = :batch,
-        schedule::Symbol =:dynamic,
-        outputtype::Type = Any)
-```
-
-Like `tmapreduce` except the order of the `f` and `op` arguments are switched. Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined results.
-
-For a very well known example of `reduce`, `sum(A)` is equivalent to `reduce(+, A)`. Doing
-
-```
- treduce(+, [1, 2, 3, 4, 5])
-```
-
-is the parallelized version of
-
-```
- (1 + 2) + (3 + 4) + 5
-```
-
-This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl).
-
-## Keyword arguments:
-
-  * `init` optional keyword argument forwarded to `mapreduce` for the sequential parts of the calculation.
-  * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead.
-  * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results!
-  * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of
-
-      * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
-      * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
-      * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule will read from the contents of `A` in a non-deterministic order, and thus your reducing `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results! This schedule will however work with non-`AbstractArray` iterables. If you use the `:greedy` scheduler, we strongly recommend you provide an `init` keyword argument.
-      * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool.
-  * `outputtype::Type` (default `Any`) will work as the asserted output type of parallel calculations. This is typically only
-
-needed if you are using a `:static` schedule, since the `:dynamic` schedule is uses [StableTasks.jl](https://github.com/MasonProtter/StableTasks.jl), but if you experience problems with type stability, you may be able to recover it with the `outputtype` keyword argument.
-
-
-</details>
-</p>
-
-____________________________
-
-<details><summary> tmap </summary>
-<p>
-
-```
-tmap(f, [OutputElementType], A::AbstractArray...; 
-     nchunks::Int = nthreads(),
-     split::Symbol = :batch,
-     schedule::Symbol =:dynamic)
-```
-
-A multithreaded function like `Base.map`. Create a new container `similar` to `A` whose `i`th element is equal to `f(A[i])`. This container is filled in parallel on multiple tasks. The optional argument `OutputElementType` will select a specific element type for the returned container, and will generally incur fewer allocations than the version where `OutputElementType` is not specified.
-
-## Keyword arguments:
-
-  * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead.
-  * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results!
-  * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of
-
-      * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
-      * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
-      * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the `OutputElementType` argument is provided.
-      * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool.
-
-
-</details>
-</p>
-
-____________________________
-
-<details><summary> tmap! </summary>
-<p>
-
-```
-tmap!(f, out, A::AbstractArray...;
-      nchunks::Int = nthreads(),
-      split::Symbol = :batch,
-      schedule::Symbol =:dynamic)
-```
-
-A multithreaded function like `Base.map!`. In parallel on multiple tasks, this function assigns each element of `out[i] = f(A[i])` for each index `i` of `A` and `out`.
-
-## Keyword arguments:
-
-  * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead.
-  * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results!
-  * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of
-
-      * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
-      * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
-      * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
-      * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool.
-
-
-</details>
-</p>
-
-____________________________
-
-<details><summary> tforeach </summary>
-<p>
-
-```
-tforeach(f, A::AbstractArray...;
-         nchunks::Int = nthreads(),
-         split::Symbol = :batch,
-         schedule::Symbol =:dynamic) :: Nothing
-```
-
-A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` on multiple parallel tasks, and return `nothing`, i.e. it is the parallel equivalent of
-
-```
-for x in A
-    f(x)
+function mc_parallel(N; kw...)
+    M = tmapreduce(+, 1:N; kw...) do i
+        rand()^2 + rand()^2 < 1.0
+    end
+    pi = 4 * M / N
+    return pi
 end
-```
-
-## Keyword arguments:
 
-  * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead.
-  * `split::Symbol` (default `:batch`) is passed to `ChunkSplitters.chunks` to inform it if the data chunks to be worked on should be contiguous (:batch) or shuffled (:scatter). If `scatter` is chosen, then your reducing operator `op` **must** be [commutative](https://en.wikipedia.org/wiki/Commutative_property) in addition to being associative, or you could get incorrect results!
-  * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of
+N = 100_000_000
+mc_parallel(N) # gives, e.g., 3.14159924
 
-      * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
-      * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
-      * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead.
-      * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool.
+using BenchmarkTools
 
+@show Threads.nthreads()          # 5 in this example
 
-</details>
-</p>
-
-____________________________
+@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread
+@btime mc_parallel($N)            # running with all 5 Julia threads
+```
 
-<details><summary> tcollect </summary>
-<p>
+Timings might be something like this:
 
 ```
-tcollect([OutputElementType], gen::Union{AbstractArray, Generator{<:AbstractArray}};
-         nchunks::Int = nthreads(),
-         schedule::Symbol =:dynamic)
+438.394 ms (7 allocations: 624 bytes)
+88.050 ms (37 allocations: 3.02 KiB)
 ```
 
-A multithreaded function like `Base.collect`. Essentially just calls `tmap` on the generator function and inputs. The optional argument `OutputElementType` will select a specific element type for the returned container, and will generally incur fewer allocations than the version where `OutputElementType` is not specified.
-
-## Keyword arguments:
-
-  * `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead.
-  * `schedule::Symbol` (default `:dynamic`), determines how the parallel portions of the calculation are scheduled. Options are one of
-
-      * `:dynamic`: generally preferred since it is more flexible and better at load balancing, and won't interfere with other multithreaded functions which may be running on the system.
-      * `:static`: can sometimes be more performant than `:dynamic` when the time it takes to complete a step of the calculation is highly uniform, and no other parallel functions are running at the same time.
-      * `:greedy`: best option for load-balancing slower, uneven computations, but does carry some additional overhead. This schedule only works if the `OutputElementType` argument is provided.
-      * `:interactive`: like `:dynamic` but runs on the high-priority interactive threadpool. This should be used very carefully since tasks on this threadpool should not be allowed to run for a long time without `yield`ing as it can interfere with [heartbeat](https://en.wikipedia.org/wiki/Heartbeat_(computing)) processes running on the interactive threadpool.
-
+(Check out the full [Parallel Monte Carlo](https://juliafolds2.github.io/OhMyThreads.jl/stable/examples/mc/mc/) example if you like.)
 
-</details>
-</p>
+## Documentation
 
-____________________________
+For more information, please check out the [documentation](https://JuliaFolds2.github.io/OhMyThreads.jl/stable) of the latest release (or the [development version](https://JuliaFolds2.github.io/OhMyThreads.jl/dev) if you're curious).
 
diff --git a/docs/make.jl b/docs/make.jl
index 12c8760e..91f343b9 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -12,6 +12,7 @@ makedocs(;
     doctest = false,
     pages = [
         "OhMyThreads" => "index.md",
+        # "Getting Started" => "examples/getting_started.md",
          "Examples" => [
              "Parallel Monte Carlo" => "examples/mc/mc.md",
              "Julia Set" => "examples/juliaset/juliaset.md",
@@ -19,6 +20,7 @@ makedocs(;
         #  "Explanations" => [
         #      "B" => "explanations/B.md",
         #  ],
+        "Translation Guide" => "translation.md",
         "References" => [
             "Public API" => "refs/api.md",
             "Internal" => "refs/internal.md",
diff --git a/docs/src/index.md b/docs/src/index.md
index a302095d..10ea194b 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -1,10 +1,9 @@
 # OhMyThreads.jl
 
-[OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl/) is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations via higher-order functions, with a
-focus on [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism) without needing to expose julia's
-[Task](https://docs.julialang.org/en/v1/base/parallel/) model to users.
+[OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl/) is meant to be a simple, unambitious package that provides user-friendly ways of doing task-parallel multithreaded calculations in Julia. Most importantly, it provides an API of higher-order functions, with a
+focus on [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism), that can be used without having to worry much about manual [Task](https://docs.julialang.org/en/v1/base/parallel/) creation.
 
-## Installation
+## Quick Start
 
 The package is registered. Hence, you can simply use
 ```
@@ -12,10 +11,42 @@ The package is registered. Hence, you can simply use
 ```
 to add the package to your Julia environment.
 
-## Noteworthy Alternatives
+### Basic example
 
-* [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl)
-* [Folds.jl](https://github.com/JuliaFolds/Folds.jl)
+```julia
+using OhMyThreads
+
+function mc_parallel(N; kw...)
+    M = tmapreduce(+, 1:N; kw...) do i
+        rand()^2 + rand()^2 < 1.0
+    end
+    pi = 4 * M / N
+    return pi
+end
+
+N = 100_000_000
+mc_parallel(N) # gives, e.g., 3.14159924
+
+using BenchmarkTools
+
+@show Threads.nthreads()          # 5 in this example
+
+@btime mc_parallel($N; nchunks=1) # effectively running with a single Julia thread
+@btime mc_parallel($N)            # running with all 5 Julia threads
+```
+
+Timings might be something like this:
+
+```
+438.394 ms (7 allocations: 624 bytes)
+88.050 ms (37 allocations: 3.02 KiB)
+```
+
+(Check out the full [Parallel Monte Carlo](@ref) example if you like.)
+
+## No Transducers
+
+Unlike most [JuliaFolds2](https://github.com/JuliaFolds2) packages, OhMyThreads.jl is not built off of [Transducers.jl](https://github.com/JuliaFolds2/Transducers.jl), nor is it a building block for Transducers.jl. Rather, it is meant to be a simpler, more maintainable, and more accessible alternative to high-level packages like, e.g., [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) or [Folds.jl](https://github.com/JuliaFolds2/Folds.jl).
 
 ## Acknowledgements
 
diff --git a/docs/src/refs/api.md b/docs/src/refs/api.md
index 1d0fafce..01aeb64a 100644
--- a/docs/src/refs/api.md
+++ b/docs/src/refs/api.md
@@ -1,4 +1,4 @@
-# Public API
+# [Public API](@id API)
 
 ## Index
 
diff --git a/docs/src/translation.md b/docs/src/translation.md
new file mode 100644
index 00000000..09da1c5e
--- /dev/null
+++ b/docs/src/translation.md
@@ -0,0 +1,122 @@
+# Translation Guide
+
+This page tries to give a general overview of how to translate patterns written with the built-in tools of [Base.Threads](https://docs.julialang.org/en/v1/base/multi-threading/) using the [OhMyThreads.jl API](@ref API). Note that this should be seen as a rough guide and (intentionally) isn't supposed to replace a systematic introduction into OhMyThreads.jl.
+
+## Basics
+
+### `@threads`
+
+```julia
+# Base.Threads
+@threads for i in 1:10
+    println(i)
+end
+```
+
+```julia
+# OhMyThreads
+tforeach(1:10) do i
+    println(i)
+end
+```
+
+#### `:static` scheduling
+
+```julia
+# Base.Threads
+@threads :static for i in 1:10
+    println(i)
+end
+```
+
+```julia
+# OhMyThreads
+tforeach(1:10; schedule=:static) do i
+    println(i)
+end
+```
+
+### `@spawn`
+
+```julia
+# Base.Threads
+@sync for i in 1:10
+    @spawn println(i)
+end
+```
+
+```julia
+# OhMyThreads
+tforeach(1:10; nchunks=10) do i
+    println(i)
+end
+```
+
+## Reduction
+
+No built-in feature in Base.Threads.
+
+```julia
+# Base.Threads: basic manual implementation
+data = rand(10)
+chunks_itr = Iterators.partition(data, length(data) ÷ nthreads())
+tasks = map(chunks_itr) do chunk
+    @spawn reduce(+, chunk)
+end
+reduce(+, fetch.(tasks))
+```
+
+```julia
+# OhMyThreads
+data = rand(10)
+treduce(+, data)
+```
+
+## Mutation
+
+!!! warning
+    Parallel mutation of non-local state, like writing to a shared array, can be the source of correctness errors (e.g. race conditions) and big performance issues (e.g. [false sharing](https://en.wikipedia.org/wiki/False_sharing#:~:text=False%20sharing%20is%20an%20inherent,is%20limited%20to%20RAM%20caches.)). You should carefully consider whether this is necessary or whether the use of task-local storage is the better option.
+
+```julia
+# Base.Threads
+data = rand(10)
+@threads for i in 1:10
+    data[i] = calc(i)
+end
+```
+
+```julia
+# OhMyThreads: Variant 1
+data = rand(10)
+tforeach(data) do i
+    data[i] = calc(i)
+end
+```
+
+```julia
+# OhMyThreads: Variant 2
+data = rand(10)
+tmap!(data, data) do i # this kind of aliasing is fine
+    calc(i)
+end
+```
+
+## Parallel initialization
+
+```julia
+# Base.Threads
+data = Vector{Float64}(undef, 10)
+@threads for i in 1:10
+    data[i] = calc(i)
+end
+```
+
+```julia
+# OhMyThreads: Variant 1
+data = tmap(i->calc(i), 1:10)
+```
+
+```julia
+# OhMyThreads: Variant 2
+data = tcollect(calc(i) for i in 1:10)
+```
\ No newline at end of file
diff --git a/src/OhMyThreads.jl b/src/OhMyThreads.jl
index cd10771f..c0cb8a10 100644
--- a/src/OhMyThreads.jl
+++ b/src/OhMyThreads.jl
@@ -14,21 +14,29 @@ export chunks, treduce, tmapreduce, treducemap, tmap, tmap!, tforeach, tcollect
                schedule::Symbol =:dynamic,
                outputtype::Type = Any)
 
-A multithreaded function like `Base.mapreduce`. Perform a reduction over `A`, applying a single-argument
-function `f` to each element, and then combining them with the two-argument function `op`. `op` **must** be an
-[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense that
-`op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will get undefined
-results.
+A multithreaded function like `Base.mapreduce`. Perform a reduction over `A`, applying a
+single-argument function `f` to each element, and then combining them with the two-argument
+function `op`.
 
-For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing
+Note that `op` **must** be an
+[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense
+that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you
+will get undefined results.
+
+For parallelization, the data is divided into chunks and a parallel task is created per
+chunk.
+
+To see the keyword argument options, check out `??tmapreduce`.
+
+## Example:
 
      tmapreduce(√, +, [1, 2, 3, 4, 5])
 
-is the parallelized version of
+is the parallelized version of `sum(√, [1, 2, 3, 4, 5])` in the form
 
      (√1 + √2) + (√3 + √4) + √5
 
-This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl).
+# Extended help
 
 ## Keyword arguments:
 
@@ -53,22 +61,30 @@ function tmapreduce end
                schedule::Symbol =:dynamic,
                outputtype::Type = Any)
 
-Like `tmapreduce` except the order of the `f` and `op` arguments are switched. This is sometimes convenient with `do`-block notation.
-Perform a reduction over `A`, applying a single-argument function `f` to each element, and then combining them with the two-argument
-function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function,
-in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will
-get undefined results.
+Like `tmapreduce` except the order of the `f` and `op` arguments are switched. This is
+sometimes convenient with `do`-block notation. Perform a reduction over `A`, applying a
+single-argument function `f` to each element, and then combining them with the two-argument
+function `op`.
 
-For a very well known example of `mapreduce`, `sum(f, A)` is equivalent to `mapreduce(f, +, A)`. Doing
+Note that `op` **must** be an
+[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense
+that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you
+will get undefined results.
 
-     treducemap(+, √, [1, 2, 3, 4, 5])
+For parallelization, the data is divided into chunks and a parallel task is created per
+chunk.
 
-is the parallelized version of
+To see the keyword argument options, check out `??treducemap`.
 
-     (√1 + √2) + (√3 + √4) + √5
+## Example:
+
+     tmapreduce(√, +, [1, 2, 3, 4, 5])
+
+is the parallelized version of `sum(√, [1, 2, 3, 4, 5])` in the form
 
+     (√1 + √2) + (√3 + √4) + √5
 
-This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl).
+# Extended help
 
 ## Keyword arguments:
 
@@ -94,21 +110,28 @@ function treducemap end
             schedule::Symbol =:dynamic,
             outputtype::Type = Any)
 
-A multithreaded function like `Base.reduce`. Perform a reduction over `A` using the two-argument
-function `op`. `op` **must** be an [associative](https://en.wikipedia.org/wiki/Associative_property) function,
-in the sense that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you will
-get undefined results.
+A multithreaded function like `Base.reduce`. Perform a reduction over `A` using the
+two-argument function `op`.
+
+Note that `op` **must** be an
+[associative](https://en.wikipedia.org/wiki/Associative_property) function, in the sense
+that `op(a, op(b, c)) ≈ op(op(a, b), c)`. If `op` is not (approximately) associative, you
+will get undefined results.
 
-For a very well known example of `reduce`, `sum(A)` is equivalent to `reduce(+, A)`. Doing
+For parallelization, the data is divided into chunks and a parallel task is created per
+chunk.
 
-     treduce(+, [1, 2, 3, 4, 5])
+To see the keyword argument options, check out `??treduce`.
 
-is the parallelized version of
+## Example:
 
-     (1 + 2) + (3 + 4) + 5
+        treduce(+, [1, 2, 3, 4, 5])
 
+is the parallelized version of `sum([1, 2, 3, 4, 5])` in the form
 
-This data is divided into chunks to be worked on in parallel using [ChunkSplitters.jl](https://github.com/JuliaFolds2/ChunkSplitters.jl).
+        (1 + 2) + (3 + 4) + 5
+
+# Extended help
 
 ## Keyword arguments:
 
@@ -131,12 +154,26 @@ function treduce end
              split::Symbol = :batch,
              schedule::Symbol =:dynamic) :: Nothing
 
-A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` on multiple parallel tasks, and return `nothing`, i.e. it is the parallel equivalent of
+A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` on
+multiple parallel tasks, and return `nothing`. I.e. it is the parallel equivalent of
 
     for x in A
         f(x)
     end
 
+For parallelization, the data is divided into chunks and a parallel task is created per
+chunk.
+
+To see the keyword argument options, check out `??tforeach`.
+
+## Example:
+
+        tforeach(1:10) do i
+            println(i^2)
+        end
+
+# Extended help
+
 ## Keyword arguments:
 
 - `nchunks::Int` (default `nthreads()`) is passed to `ChunkSplitters.chunks` to inform it how many pieces of data should be worked on in parallel. Greater `nchunks` typically helps with [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)), but at the expense of creating more overhead.
@@ -150,16 +187,26 @@ A multithreaded function like `Base.foreach`. Apply `f` to each element of `A` o
 function tforeach end
 
 """
-    tmap(f, [OutputElementType], A::AbstractArray...; 
+    tmap(f, [OutputElementType], A::AbstractArray...;
          nchunks::Int = nthreads(),
          split::Symbol = :batch,
          schedule::Symbol =:dynamic)
 
-A multithreaded function like `Base.map`. Create a new container `similar` to `A` whose `i`th element is
-equal to `f(A[i])`. This container is filled in parallel on multiple tasks. The optional argument
-`OutputElementType` will select a specific element type for the returned container, and will generally incur
-fewer allocations than the version where `OutputElementType` is not specified.
+A multithreaded function like `Base.map`. Create a new container `similar` to `A` whose
+`i`th element is equal to `f(A[i])`. This container is filled in parallel: the data is
+divided into chunks and a parallel task is created per chunk.
+
+The optional argument `OutputElementType` will select a specific element type for the
+returned container, and will generally incur fewer allocations than the version where
+`OutputElementType` is not specified.
+
+To see the keyword argument options, check out `??tmap`.
+
+## Example:
 
+        tmap(sin, 1:10)
+
+# Extended help
 
 ## Keyword arguments:
 
@@ -179,8 +226,15 @@ function tmap end
           split::Symbol = :batch,
           schedule::Symbol =:dynamic)
 
-A multithreaded function like `Base.map!`. In parallel on multiple tasks, this function assigns each element
-of `out[i] = f(A[i])` for each index `i` of `A` and `out`.
+A multithreaded function like `Base.map!`. In parallel on multiple tasks, this function
+assigns each element of `out[i] = f(A[i])` for each index `i` of `A` and `out`.
+
+For parallelization, the data is divided into chunks and a parallel task is created per
+chunk.
+
+To see the keyword argument options, check out `??tmap!`.
+
+# Extended help
 
 ## Keyword arguments:
 
@@ -199,8 +253,20 @@ function tmap! end
              nchunks::Int = nthreads(),
              schedule::Symbol =:dynamic)
 
-A multithreaded function like `Base.collect`. Essentially just calls `tmap` on the generator function and
-inputs. The optional argument `OutputElementType` will select a specific element type for the returned container, and will generally incur fewer allocations than the version where `OutputElementType` is not specified.
+A multithreaded function like `Base.collect`. Essentially just calls `tmap` on the
+generator function and inputs.
+
+The optional argument `OutputElementType` will select a specific element type for the
+returned container, and will generally incur fewer allocations than the version where
+`OutputElementType` is not specified.
+
+To see the keyword argument options, check out `??tcollect`.
+
+## Example:
+
+        tcollect(sin(i) for i in 1:10)
+
+# Extended help
 
 ## Keyword arguments: