Add `WithTaskLocals` to make the handling of task local values more efficient. #63

MasonProtter · 2024-03-01T17:15:56Z

What?

This gives a way to make it so that you don't have to access the TaskLocalValue every look iteration. Rather, you only access the TaskLocalValue after we @spawn a task, i.e. basically it should match the way that you would manually use task_local_storage.

For things where each iteration takes a long time, this is pretty insignificant, since accessing task local storage only takes around 10ns, but for fast functions this can be really important.

E.g. suppose we have a thread local accumulator we use in a sum:

julia> using OhMyThreads

julia> function sum_accumulator_macro(ns)
           @tasks for n ∈ ns
               @set reducer = (+)
               @init acc::(Base.RefValue{Int}) = Ref{Int}()
               acc[] = 0
               for i ∈ 1:n
                   acc[] += i
               end
               acc[]
           end
       end;

julia> function sum_accumulator(ns)
           acc = OhMyThreads.TaskLocalValue{Base.RefValue{Int}}(() -> Ref{Int}())
           tmapreduce(+, ns) do n
               acc[][] = 0
               for i ∈ 1:n
                   acc[][] += i
               end
               acc[][]
           end
       end;

julia> let ns = rand(1:10, 10000)
           r1 = @btime sum_accumulator_macro($ns)
           r2 = @btime sum_accumulator($ns)
           r1 ≈ r2
       end
  5.907 μs (120 allocations: 11.46 KiB)
  97.294 μs (120 allocations: 11.45 KiB)
true

Now, one can fairly say "Well, you shouldn't have re-accessed acc so many times in the body of sum_accumulator!" Which is true, but even then, sum_accumulator_macro is significantly faster

julia> function sum_accumulator_better(ns)
           acc = OhMyThreads.TaskLocalValue{Base.RefValue{Int}}(() -> Ref{Int}())
           tmapreduce(+, ns) do n
               r =  acc[]
               r[] = 0
               for i ∈ 1:n
                   r[] += i
               end
               r[]
           end
       end;

julia> let ns = rand(1:10, 10000)
           r1 = @btime sum_accumulator_macro($ns)
           r2 = @btime sum_accumulator_better($ns)
           r1 ≈ r2
       end
  5.951 μs (120 allocations: 11.46 KiB)
  21.080 μs (120 allocations: 11.45 KiB)
true

And here's what these would look like if we do a map instead:

julia> function map_accumulator_macro(ns)
           @tasks for n ∈ ns
               @set collect = true
               @init acc::(Base.RefValue{Int}) = Ref{Int}()
               acc[] = 0
               for i ∈ 1:n
                   acc[] += i
               end
               acc[]
           end
       end;

julia> function map_accumulator(ns)
           acc = OhMyThreads.TaskLocalValue{Base.RefValue{Int}}(() -> Ref{Int}())
           tmap(ns) do n
               acc[][] = 0
               for i ∈ 1:n
                   acc[][] += i
               end
               acc[][]
           end
       end;

julia> function map_accumulator_better(ns)
           acc = OhMyThreads.TaskLocalValue{Base.RefValue{Int}}(() -> Ref{Int}())
           tmap(ns) do n
               r =  acc[]
               r[] = 0
               for i ∈ 1:n
                   r[] += i
               end
               r[]
           end
       end;

julia> let ns = rand(1:10, 10000)
           r1 = @btime map_accumulator_macro($ns)
           r2 = @btime map_accumulator($ns)
           r3 = @btime map_accumulator_better($ns)
           r1 ≈ r2 ≈ r3
       end
  18.064 μs (206 allocations: 245.02 KiB)
  110.309 μs (138 allocations: 242.41 KiB)
  33.824 μs (138 allocations: 242.41 KiB)
true

What about a less cherry picked example?

This gain becomes more and more mild, the longer the function call takes. E.g. if I re-run the benchmarks in https://juliafolds2.github.io/OhMyThreads.jl/v0.4/literate/tls/tls/#Benchmark, I find

julia> let As = [rand(2056, 32) for _ in 1:192],
           Bs = [rand(32, 2056) for _ in 1:192]
           r1 = @btime matmulsums_manual($As, $Bs)    seconds=10
           r2 = @btime matmulsums_tlv($As, $Bs)       seconds=10
           r3 = @btime matmulsums_tlv_macro($As, $Bs) seconds=10
           r1 ≈ r2 ≈ r3
       end
  1.061 s (124 allocations: 387.03 MiB)
  1.082 s (148 allocations: 387.02 MiB)
  1.058 s (207 allocations: 387.03 MiB)
true

so a very small difference, though it puts things more in line with the manual case, as expected, though what I find unexpected is that there's more allocations in this case, I'm not sure what exactly that's about, but I think it has to do with the map rewarpping step I had to do.

Observe that turning these into a reduction, we get:

julia> function sum_matmulsums_tlv(As, Bs; kwargs...)
           N = size(first(As), 1)
           tlv = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
           tmapreduce(+, As, Bs; kwargs...) do A, B
               C = tlv[]
               mul!(C, A, B)
               sum(C)
           end
       end;

julia> function sum_matmulsums_tlv_macro(As, Bs; kwargs...)
           N = size(first(As), 1)
           @tasks for i in eachindex(As,Bs)
               @set reducer=(+)
               @init C::Matrix{Float64} = Matrix{Float64}(undef, N, N)
               mul!(C, As[i], Bs[i])
               sum(C)
           end
       end;

julia> let As = [rand(2056, 32) for _ in 1:192],
           Bs = [rand(32, 2056) for _ in 1:192]
           r2 = @btime sum_matmulsums_tlv($As, $Bs)       seconds=10
           r3 = @btime sum_matmulsums_tlv_macro($As, $Bs) seconds=10
           r2 ≈ r3
       end
  1.062 s (293 allocations: 387.03 MiB)
  1.056 s (134 allocations: 387.02 MiB)
true

which has fewer allocations.

How does this work?

The key here is a new closure type called WithTaskLocals (bikeshedding welcome). Essentially, if you write

TLV{T} = TaskLocalValue{T}
f = WithTaskLocals(TLV{Int}(() -> 1), TLV{Int}(() -> 2)) do x, y
    z -> (x + y)/z
end

then that creates a closure object capturing the TaskLocalValues which is equivalent to

g = let x = TLV{Int}(() -> 1), y = TLV{Int}(() -> 2)
    z -> let x = x[], y=y[]
        (x + y)/z
    end
end

however, the main difference is that you can call promise_task_local on a
WithTaskLocals closure in order to turn it into something equivalent to

let x=x[], y=y[]
    z -> (x + y)/z
end

which doesn't have the overhead of accessing the task_local_storage each time the closure is called. This of course will lose the safety advantages of TaskLocalValue, so you should never call f_local = promise_task_local(f) and then pass f_local to some unknown function, because if that unknown function calls f_local on a new thread, you'll hit a race condition.

However, we can take advantage of the structure of mapreduce calls to build up WithTaskLocals objects from the @tasks macro and then pass them to tmapreduce and tmap and then basically do @spawn promise_task_local(f)(args...) because we know at that point that f is actually being called so it's safe to make the promise.

Can we get these performance advantages without using the `@tasks` macro?

You betcha, but it's kinda ugly:

julia> function sum_accumulator_more_better(ns)
           acc = OhMyThreads.TaskLocalValue{Base.RefValue{Int}}(() -> Ref{Int}())
           f = OhMyThreads.WithTaskLocals((acc,)) do (acc,)
               function f(n)
                   acc[] = 0
                   for i ∈ 1:n
                       acc[] += i
                   end
                   acc[]
               end
           end
           tmapreduce(f, +, ns)
       end;

julia> let ns = rand(1:10, 10000)
           r1 = @btime sum_accumulator_macro($ns)
           r2 = @btime sum_accumulator($ns)
           r3 = @btime sum_accumulator_better($ns)
           r4 = @btime sum_accumulator_more_better($ns)
           r1 ≈ r2 ≈ r3 ≈ r4
       end
  4.996 μs (119 allocations: 11.40 KiB)
  98.105 μs (119 allocations: 11.38 KiB)
  26.300 μs (119 allocations: 11.38 KiB)
  4.993 μs (119 allocations: 11.38 KiB)
true

Todo

Docs
Should WithTaskLocals be exported?
Should WithTaskLocals be upstreamed to TaskLocalValues.jl?
Any API/naming concerns?

vchuravy · 2024-03-02T20:38:14Z

One thing we can (and should eventually do) is to apply the optimizations we do for ScopedValue to TaskLocalValue. The compiler folds repeated accesses.

MasonProtter · 2024-03-02T20:45:55Z

Would that be possible in a package without compiler changes?

carstenbauer · 2024-03-03T14:20:44Z

This looks good to me. Assuming that we can't easily get this as an optimization in the near future, I'm fine with moving on with this. To be sure, this doesn't negatively affect the performance if no TLVs are used (because promise_task_local is just the identity in this case), right? (What about maybe_rewrap in this case?)

To your questions (in the OP):

Should WithTaskLocals be exported?

I'd say no. We don't even export TaskLocalValue itself. Also, it's a (rather cumbersome) optimization to do manually and I think it's fine that people have to access it explicitly if they want to go through these hoops (most people won't I guess).

Should WithTaskLocals be upstreamed to TaskLocalValues.jl?

Maybe. I don't have a strong opinion here and would leave this decision to Valentin.

Any API/naming concerns?

I think the API is fine. I also think the name is fine but, since we are pretty verbose here anyways, we could also make it WithTaskLocalValues to align with TaskLocalValues exactly. It's a minor point for me though.

MasonProtter · 2024-03-04T19:05:33Z

This looks good to me. Assuming that we can't easily get this as an optimization in the near future, I'm fine with moving on with this. To be sure, this doesn't negatively affect the performance if no TLVs are used (because promise_task_local is just the identity in this case), right? (What about maybe_rewrap in this case?)

yeah, that's right they're both identity ops if no TLVs are used

OhMyThreads.jl/src/implementation.jl

Lines 201 to 219 in 717799f

    
           function maybe_rewrap(g::G, f::F) where {G, F} 
        
               g(f) 
        
           end 
        
           """ 
        
              maybe_rewrap(g, f) 
        
           takes a closure `g(f)` and if `f` is a `WithTaskLocals`, we're going 
        
           to unwrap `f` and delegate its `TaskLocalValues` to `g`. 
        
           This should always be equivalent to just calling `g(f)`. 
        
           """ 
        
           function maybe_rewrap(g::G, f::WithTaskLocals{F}) where {G, F} 
        
               (;inner_func, tasklocals) = f 
        
               WithTaskLocals(f.tasklocals) do vals 
        
                   f = inner_func(vals) 
        
                   g(f) 
        
               end 
        
           end

I think the API is fine. I also think the name is fine but, since we are pretty verbose here anyways, we could also make it WithTaskLocalValues to align with TaskLocalValues exactly. It's a minor point for me though.

I actually originally named it that, but it was so annoying to type that I shortened it.

MasonProtter · 2024-03-05T18:03:35Z

Okay, this should be ready to go so long as you approve of the changes I made to the task local storage docs @carstenbauer

carstenbauer · 2024-03-06T12:13:21Z

I'll take a look later today

carstenbauer

Apart from the minor formatting bug, LGTM.

src/types.jl

MasonProtter added 7 commits March 1, 2024 14:39

add WithTaskLocalValues to efficiently close of TaskLocalValues

ad66db7

add WithTaskLocalValues to efficiently close of TaskLocalValues

b0a719f

mention that @init is now dereferenced once per task

6f26c1b

handle nothing case

8789beb

fix for mapping functions

665dbd4

docs

c2bd8c9

add some notes

22f44b1

MasonProtter requested a review from carstenbauer March 1, 2024 17:16

MasonProtter added 4 commits March 2, 2024 17:39

rename WithTaskLocalValues -> WithTaskLocals

cbae525

add tests

062c397

add version log

6e0662a

fix

6435fff

Merge branch 'master' into WithTaskLocalValues

717799f

MasonProtter added 5 commits March 4, 2024 20:09

Update macro_impl.jl

fb9ebaa

Update macro_impl.jl

ba0eac6

Update macro_impl.jl

f676183

Update macro_impl.jl

05c545a

update docs

8b79357

MasonProtter changed the title ~~Add WithTaskLocalValues to make the handling of task local values more efficient.~~ Add WithTaskLocals to make the handling of task local values more efficient. Mar 5, 2024

Update CHANGELOG.md

2f7b662

Merge branch 'master' into WithTaskLocalValues

67ef97d

carstenbauer approved these changes Mar 6, 2024

View reviewed changes

src/types.jl Show resolved Hide resolved

fix docstring

7ea28bd

MasonProtter merged commit 0f9d61b into master Mar 6, 2024
10 checks passed

MasonProtter deleted the WithTaskLocalValues branch March 6, 2024 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `WithTaskLocals` to make the handling of task local values more efficient. #63

Add `WithTaskLocals` to make the handling of task local values more efficient. #63

MasonProtter commented Mar 1, 2024 •

edited

Loading

vchuravy commented Mar 2, 2024

MasonProtter commented Mar 2, 2024 •

edited

Loading

carstenbauer commented Mar 3, 2024

MasonProtter commented Mar 4, 2024

MasonProtter commented Mar 5, 2024

carstenbauer commented Mar 6, 2024

carstenbauer left a comment

Add WithTaskLocals to make the handling of task local values more efficient. #63

Add WithTaskLocals to make the handling of task local values more efficient. #63

Conversation

MasonProtter commented Mar 1, 2024 • edited Loading

What?

What about a less cherry picked example?

How does this work?

Can we get these performance advantages without using the @tasks macro?

Todo

vchuravy commented Mar 2, 2024

MasonProtter commented Mar 2, 2024 • edited Loading

carstenbauer commented Mar 3, 2024

MasonProtter commented Mar 4, 2024

MasonProtter commented Mar 5, 2024

carstenbauer commented Mar 6, 2024

carstenbauer left a comment

Choose a reason for hiding this comment

Add `WithTaskLocals` to make the handling of task local values more efficient. #63

Add `WithTaskLocals` to make the handling of task local values more efficient. #63

MasonProtter commented Mar 1, 2024 •

edited

Loading

Can we get these performance advantages without using the `@tasks` macro?

MasonProtter commented Mar 2, 2024 •

edited

Loading