Add WithTaskLocals to make the handling of task local values more e…

…fficient. (#63) * add `WithTaskLocalValues` to efficiently close of TaskLocalValues * add `WithTaskLocalValues` to efficiently close of TaskLocalValues * mention that `@init` is now dereferenced once per task * handle `nothing` case * fix for mapping functions * docs * add some notes * rename WithTaskLocalValues -> WithTaskLocals * add tests * add version log * fix * Update macro_impl.jl * Update macro_impl.jl * Update macro_impl.jl * Update macro_impl.jl * update docs * Update CHANGELOG.md * fix docstring
JuliaFolds2 · Mar 6, 2024 · 0f9d61b · 0f9d61b
1 parent 0ed78b4
commit 0f9d61b
Show file tree

Hide file tree

Showing 9 changed files with 278 additions and 109 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,8 @@ OhMyThreads.jl Changelog
 Version 0.5.0
 -------------
 
+- ![feature][badge-feature] Added `OhMyThreads.WithTaskLocals` that represents a closure over `TaskLocalValues`, but can have those values materialized as an optimization (using `OhMyThreads.promise_task_local`)
+- ![Enhancement][badge-enhancement] Made `@tasks` use `OhMyThreads.WithTaskLocals` automatically as an optimization.
 - ![Feature][badge-feature] In the case `nchunks > nthreads()`, the `StaticScheduler` now distributes chunks in a round-robin fashion (instead of either implicitly decreasing `nchunks` to `nthreads()` or throwing an error).
 - ![Feature][badge-feature] The `DynamicScheduler` (default) and the `StaticScheduler` now support a `chunksize` argument to specify the desired size of chunks instead of the number of chunks (`nchunks`). Note that `chunksize` and `nchunks` are mutually exclusive.
 - ![Feature][badge-feature] `@set init = ...` may now be used to specify an initial value for a reduction (only has an effect in conjuction with `@set reducer=...` and triggers a warning otherwise).

diff --git a/docs/src/literate/tls/tls.jl b/docs/src/literate/tls/tls.jl
@@ -36,8 +36,8 @@ end
 #
 # For later comparison, we generate some random input data and store the result.
 
-As = [rand(2056, 32) for _ in 1:192]
-Bs = [rand(32, 2056) for _ in 1:192]
+As = [rand(256, 16) for _ in 1:768]
+Bs = [rand(16, 256) for _ in 1:768]
 
 res = matmulsums(As, Bs);
 
@@ -183,7 +183,9 @@ end
 res_tlv_macro = matmulsums_tlv_macro(As, Bs)
 res ≈ res_tlv_macro
 
-# Here, `@local` simply expands to the explicit pattern around `TaskLocalValue` above.
+# Here, `@local` expands to a pattern similar to the `TaskLocalValue` one above, although it
+# carries some optimizations (see [`OhMyThreads.WithTaskLocals`](@ref)) which can make accessing task
+# local values more efficient in loops which take on the order of 100ns to complete.
 #
 #
 # ### Benchmark
@@ -196,9 +198,13 @@ using BenchmarkTools
 @show nthreads()
 
 @btime matmulsums($As, $Bs);
+sleep(2) #hide
 @btime matmulsums_naive($As, $Bs);
+sleep(2) #hide
 @btime matmulsums_manual($As, $Bs);
+sleep(2) #hide
 @btime matmulsums_tlv($As, $Bs);
+sleep(2) #hide
 @btime matmulsums_tlv_macro($As, $Bs);
 
 # As we can see, `matmulsums_tlv` (and `matmulsums_tlv_macro`) isn't only convenient
@@ -249,8 +255,8 @@ function matmulsums_perthread_naive(As, Bs)
 end
 
 ## non uniform workload
-As_nu = [rand(2056, isqrt(i)^2) for i in 1:192];
-Bs_nu = [rand(isqrt(i)^2, 2056) for i in 1:192];
+As_nu = [rand(256, isqrt(i)^2) for i in 1:768];
+Bs_nu = [rand(isqrt(i)^2, 256) for i in 1:768];
 res_nu = matmulsums(As_nu, Bs_nu);
 
 res_pt_naive = matmulsums_perthread_naive(As_nu, Bs_nu)
@@ -366,7 +372,7 @@ res_nu ≈ res_pt_channel
 using OhMyThreads: tmapreduce
 function matmulsums_perthread_channel_flipped(As, Bs; ntasks = nthreads())
     N = size(first(As), 1)
-    chnl = Channel() do chnl
+    chnl = Channel{Int}(length(As); spawn=true) do chnl
         for i in 1:length(As)
             put!(chnl, i)
         end
@@ -401,20 +407,20 @@ sort(res_nu) ≈ sort(res_channel_flipped)
 # give [Bumper.jl](https://github.com/MasonProtter/Bumper.jl) a try. Essentially, it
 # allows you to *bring your own stacks*, that is, task-local bump allocators which you can
 # dynamically allocate memory to, and reset them at the end of a code block, just like
-# Julia's stack.
+# Julia's stack. 
 # Be warned though that Bumper.jl is (1) a rather young package with (likely) some bugs
-# and (2) can easily lead to segfaults when used incorrectly. It can make sense to use it
-# though if you can live with the risk and really can't avoid allocating many (many) times
-# on each parallel task. For our example, this isn't the case but let's nonetheless how one
-# would use Bumper.jl here.
+# and (2) can easily lead to segfaults when used incorrectly. If you can live with the
+# risk, Bumper.jl is especially useful for causes  we don't know ahead of time how large
+# a matrix to pre-allocate, and even more useful if we want to do many intermediate
+# allocations on the task, not just one. For our example, this isn't the case but let's
+# nonetheless how one would use Bumper.jl here.
 
 using Bumper
-using StrideArrays # makes things a little bit faster
 
 function matmulsums_bumper(As, Bs)
-    N = size(first(As), 1)
     tmap(As, Bs) do A, B
         @no_escape begin # promising that no memory will escape
+            N = size(A, 1)
             C = @alloc(Float64, N, N) # from bump allocater (fake "stack")
             mul!(C, A, B)
             sum(C)
@@ -423,7 +429,7 @@ function matmulsums_bumper(As, Bs)
 end
 
 res_bumper = matmulsums_bumper(As, Bs);
-res ≈ res_bumper
+sort(res_nu) ≈ sort(res_bumper)
 
 @btime matmulsums_bumper($As, $Bs);
 

diff --git a/docs/src/literate/tls/tls.md b/docs/src/literate/tls/tls.md
@@ -46,8 +46,8 @@ temporary buffer, the output matrix `C`. This is to avoid the temporary allocati
 For later comparison, we generate some random input data and store the result.
 
 ````julia
-As = [rand(2056, 32) for _ in 1:192]
-Bs = [rand(32, 2056) for _ in 1:192]
+As = [rand(256, 16) for _ in 1:768]
+Bs = [rand(16, 256) for _ in 1:768]
 
 res = matmulsums(As, Bs);
 ````
@@ -238,7 +238,9 @@ res ≈ res_tlv_macro
 true
 ````
 
-Here, `@local` simply expands to the explicit pattern around `TaskLocalValue` above.
+Here, `@local` expands to a pattern similar to the `TaskLocalValue` one above, although it
+carries some optimizations (see [`OhMyThreads.WithTaskLocals`](@ref)) which can make accessing task
+local values more efficient in loops which take on the order of 100ns to complete.
 
 
 ### Benchmark
@@ -260,11 +262,11 @@ using BenchmarkTools
 
 ````
 nthreads() = 10
-  1.461 s (3 allocations: 32.25 MiB)
-  956.497 ms (539 allocations: 6.05 GiB)
-  749.799 ms (200 allocations: 645.04 MiB)
-  743.885 ms (236 allocations: 645.04 MiB)
-  746.067 ms (237 allocations: 645.04 MiB)
+  49.077 ms (3 allocations: 518.17 KiB)
+  32.658 ms (1691 allocations: 384.08 MiB)
+  9.513 ms (200 allocations: 10.08 MiB)
+  9.588 ms (236 allocations: 10.05 MiB)
+  9.650 ms (239 allocations: 10.05 MiB)
 
 ````
 
@@ -289,8 +291,8 @@ using OhMyThreads: DynamicScheduler, StaticScheduler
 ````
 
 ````
-  878.870 ms (124 allocations: 322.52 MiB)
-  888.337 ms (122 allocations: 322.52 MiB)
+  9.561 ms (124 allocations: 5.03 MiB)
+  9.618 ms (124 allocations: 5.03 MiB)
 
 ````
 
@@ -324,8 +326,8 @@ function matmulsums_perthread_naive(As, Bs)
 end
 
 # non uniform workload
-As_nu = [rand(2056, isqrt(i)^2) for i in 1:192];
-Bs_nu = [rand(isqrt(i)^2, 2056) for i in 1:192];
+As_nu = [rand(256, isqrt(i)^2) for i in 1:768];
+Bs_nu = [rand(isqrt(i)^2, 256) for i in 1:768];
 res_nu = matmulsums(As_nu, Bs_nu);
 
 res_pt_naive = matmulsums_perthread_naive(As_nu, Bs_nu)
@@ -444,13 +446,13 @@ of which gives us dynamic load balancing.
 ````
 
 ````
-  1.012 s (124 allocations: 322.52 MiB)
-  990.424 ms (105 allocations: 322.52 MiB)
-  998.003 ms (112 allocations: 322.52 MiB)
-  876.152 ms (235 allocations: 645.04 MiB)
-  913.288 ms (183 allocations: 322.53 MiB)
-  930.444 ms (1116 allocations: 3.15 GiB)
-  832.168 ms (744 allocations: 322.58 MiB)
+  149.095 ms (124 allocations: 5.03 MiB)
+  175.355 ms (107 allocations: 5.02 MiB)
+  148.470 ms (112 allocations: 5.02 MiB)
+  137.638 ms (235 allocations: 10.05 MiB)
+  135.293 ms (183 allocations: 5.04 MiB)
+  124.591 ms (1116 allocations: 50.13 MiB)
+  124.716 ms (744 allocations: 5.10 MiB)
 
 ````
 
@@ -469,7 +471,7 @@ a limited number of tasks (e.g. `nthreads()`) with task-local buffers.
 using OhMyThreads: tmapreduce
 function matmulsums_perthread_channel_flipped(As, Bs; ntasks = nthreads())
     N = size(first(As), 1)
-    chnl = Channel() do chnl
+    chnl = Channel{Int}(length(As); spawn=true) do chnl
         for i in 1:length(As)
             put!(chnl, i)
         end
@@ -508,9 +510,9 @@ Quick benchmark:
 ````
 
 ````
-  954.269 ms (726 allocations: 322.54 MiB)
-  927.246 ms (860 allocations: 645.06 MiB)
-  929.689 ms (1746 allocations: 3.15 GiB)
+  121.715 ms (163 allocations: 5.07 MiB)
+  122.457 ms (267 allocations: 10.11 MiB)
+  122.374 ms (1068 allocations: 50.37 MiB)
 
 ````
 
@@ -522,19 +524,19 @@ allows you to *bring your own stacks*, that is, task-local bump allocators which
 dynamically allocate memory to, and reset them at the end of a code block, just like
 Julia's stack.
 Be warned though that Bumper.jl is (1) a rather young package with (likely) some bugs
-and (2) can easily lead to segfaults when used incorrectly. It can make sense to use it
-though if you can live with the risk and really can't avoid allocating many (many) times
-on each parallel task. For our example, this isn't the case but let's nonetheless how one
-would use Bumper.jl here.
+and (2) can easily lead to segfaults when used incorrectly. If you can live with the
+risk, Bumper.jl is especially useful for causes  we don't know ahead of time how large
+a matrix to pre-allocate, and even more useful if we want to do many intermediate
+allocations on the task, not just one. For our example, this isn't the case but let's
+nonetheless how one would use Bumper.jl here.
 
 ````julia
 using Bumper
-using StrideArrays # makes things a little bit faster
 
 function matmulsums_bumper(As, Bs)
-    N = size(first(As), 1)
     tmap(As, Bs) do A, B
         @no_escape begin # promising that no memory will escape
+            N = size(A, 1)
             C = @alloc(Float64, N, N) # from bump allocater (fake "stack")
             mul!(C, A, B)
             sum(C)
@@ -543,17 +545,19 @@ function matmulsums_bumper(As, Bs)
 end
 
 res_bumper = matmulsums_bumper(As, Bs);
-res ≈ res_bumper
+sort(res_nu) ≈ sort(res_bumper)
 
 @btime matmulsums_bumper($As, $Bs);
 ````
 
 ````
-  786.991 ms (275 allocations: 34.50 KiB)
+  9.865 ms (254 allocations: 50.92 KiB)
 
 ````
 
-Note that the benchmark is lying here about the total memory allocation, because it doesn't show the allocation of the task-local bump allocators themselves (the reason is that `SlabBuffer` uses `malloc` directly).
+Note that the benchmark is lying here about the total memory allocation,
+because it doesn't show the allocation of the task-local bump allocators themselves
+(the reason is that `SlabBuffer` uses `malloc` directly).
 
 ---
 

diff --git a/docs/src/refs/api.md b/docs/src/refs/api.md
@@ -44,3 +44,9 @@ GreedyScheduler
 | `OhMyThreads.@fetchfrom` | see [StableTasks.jl](https://github.com/JuliaFolds2/StableTasks.jl) |
 | `OhMyThreads.chunks`   | see [ChunkSplitters.jl](https://juliafolds2.github.io/ChunkSplitters.jl/dev/references/#ChunkSplitters.chunks) |
 | `OhMyThreads.TaskLocalValue`   | see [TaskLocalValues.jl](https://github.com/vchuravy/TaskLocalValues.jl) |
+
+
+```@docs
+OhMyThreads.WithTaskLocals
+OhMyThreads.promise_task_local
+```
diff --git a/src/OhMyThreads.jl b/src/OhMyThreads.jl
@@ -10,7 +10,7 @@ const chunks = ChunkSplitters.chunks
 
 using TaskLocalValues: TaskLocalValues
 const TaskLocalValue = TaskLocalValues.TaskLocalValue
-
+include("types.jl")
 include("functions.jl")
 include("macros.jl")