Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WithTaskLocals to make the handling of task local values more efficient. #63

Merged
merged 20 commits into from
Mar 6, 2024

Conversation

MasonProtter
Copy link
Member

@MasonProtter MasonProtter commented Mar 1, 2024

What?

This gives a way to make it so that you don't have to access the TaskLocalValue every look iteration. Rather, you only access the TaskLocalValue after we @spawn a task, i.e. basically it should match the way that you would manually use task_local_storage.

For things where each iteration takes a long time, this is pretty insignificant, since accessing task local storage only takes around 10ns, but for fast functions this can be really important.

E.g. suppose we have a thread local accumulator we use in a sum:

julia> using OhMyThreads

julia> function sum_accumulator_macro(ns)
           @tasks for n  ns
               @set reducer = (+)
               @init acc::(Base.RefValue{Int}) = Ref{Int}()
               acc[] = 0
               for i  1:n
                   acc[] += i
               end
               acc[]
           end
       end;

julia> function sum_accumulator(ns)
           acc = OhMyThreads.TaskLocalValue{Base.RefValue{Int}}(() -> Ref{Int}())
           tmapreduce(+, ns) do n
               acc[][] = 0
               for i  1:n
                   acc[][] += i
               end
               acc[][]
           end
       end;

julia> let ns = rand(1:10, 10000)
           r1 = @btime sum_accumulator_macro($ns)
           r2 = @btime sum_accumulator($ns)
           r1  r2
       end
  5.907 μs (120 allocations: 11.46 KiB)
  97.294 μs (120 allocations: 11.45 KiB)
true

Now, one can fairly say "Well, you shouldn't have re-accessed acc so many times in the body of sum_accumulator!" Which is true, but even then, sum_accumulator_macro is significantly faster

julia> function sum_accumulator_better(ns)
           acc = OhMyThreads.TaskLocalValue{Base.RefValue{Int}}(() -> Ref{Int}())
           tmapreduce(+, ns) do n
               r =  acc[]
               r[] = 0
               for i  1:n
                   r[] += i
               end
               r[]
           end
       end;

julia> let ns = rand(1:10, 10000)
           r1 = @btime sum_accumulator_macro($ns)
           r2 = @btime sum_accumulator_better($ns)
           r1  r2
       end
  5.951 μs (120 allocations: 11.46 KiB)
  21.080 μs (120 allocations: 11.45 KiB)
true

And here's what these would look like if we do a map instead:

julia> function map_accumulator_macro(ns)
           @tasks for n  ns
               @set collect = true
               @init acc::(Base.RefValue{Int}) = Ref{Int}()
               acc[] = 0
               for i  1:n
                   acc[] += i
               end
               acc[]
           end
       end;

julia> function map_accumulator(ns)
           acc = OhMyThreads.TaskLocalValue{Base.RefValue{Int}}(() -> Ref{Int}())
           tmap(ns) do n
               acc[][] = 0
               for i  1:n
                   acc[][] += i
               end
               acc[][]
           end
       end;

julia> function map_accumulator_better(ns)
           acc = OhMyThreads.TaskLocalValue{Base.RefValue{Int}}(() -> Ref{Int}())
           tmap(ns) do n
               r =  acc[]
               r[] = 0
               for i  1:n
                   r[] += i
               end
               r[]
           end
       end;

julia> let ns = rand(1:10, 10000)
           r1 = @btime map_accumulator_macro($ns)
           r2 = @btime map_accumulator($ns)
           r3 = @btime map_accumulator_better($ns)
           r1  r2  r3
       end
  18.064 μs (206 allocations: 245.02 KiB)
  110.309 μs (138 allocations: 242.41 KiB)
  33.824 μs (138 allocations: 242.41 KiB)
true

What about a less cherry picked example?

This gain becomes more and more mild, the longer the function call takes. E.g. if I re-run the benchmarks in https://juliafolds2.github.io/OhMyThreads.jl/v0.4/literate/tls/tls/#Benchmark, I find

julia> let As = [rand(2056, 32) for _ in 1:192],
           Bs = [rand(32, 2056) for _ in 1:192]
           r1 = @btime matmulsums_manual($As, $Bs)    seconds=10
           r2 = @btime matmulsums_tlv($As, $Bs)       seconds=10
           r3 = @btime matmulsums_tlv_macro($As, $Bs) seconds=10
           r1  r2  r3
       end
  1.061 s (124 allocations: 387.03 MiB)
  1.082 s (148 allocations: 387.02 MiB)
  1.058 s (207 allocations: 387.03 MiB)
true

so a very small difference, though it puts things more in line with the manual case, as expected, though what I find unexpected is that there's more allocations in this case, I'm not sure what exactly that's about, but I think it has to do with the map rewarpping step I had to do.

Observe that turning these into a reduction, we get:

julia> function sum_matmulsums_tlv(As, Bs; kwargs...)
           N = size(first(As), 1)
           tlv = TaskLocalValue{Matrix{Float64}}(() -> Matrix{Float64}(undef, N, N))
           tmapreduce(+, As, Bs; kwargs...) do A, B
               C = tlv[]
               mul!(C, A, B)
               sum(C)
           end
       end;

julia> function sum_matmulsums_tlv_macro(As, Bs; kwargs...)
           N = size(first(As), 1)
           @tasks for i in eachindex(As,Bs)
               @set reducer=(+)
               @init C::Matrix{Float64} = Matrix{Float64}(undef, N, N)
               mul!(C, As[i], Bs[i])
               sum(C)
           end
       end;

julia> let As = [rand(2056, 32) for _ in 1:192],
           Bs = [rand(32, 2056) for _ in 1:192]
           r2 = @btime sum_matmulsums_tlv($As, $Bs)       seconds=10
           r3 = @btime sum_matmulsums_tlv_macro($As, $Bs) seconds=10
           r2  r3
       end
  1.062 s (293 allocations: 387.03 MiB)
  1.056 s (134 allocations: 387.02 MiB)
true

which has fewer allocations.

How does this work?

The key here is a new closure type called WithTaskLocals (bikeshedding welcome). Essentially, if you write

TLV{T} = TaskLocalValue{T}
f = WithTaskLocals(TLV{Int}(() -> 1), TLV{Int}(() -> 2)) do x, y
    z -> (x + y)/z
end

then that creates a closure object capturing the TaskLocalValues which is equivalent to

g = let x = TLV{Int}(() -> 1), y = TLV{Int}(() -> 2)
    z -> let x = x[], y=y[]
        (x + y)/z
    end
end

however, the main difference is that you can call promise_task_local on a
WithTaskLocals closure in order to turn it into something equivalent to

let x=x[], y=y[]
    z -> (x + y)/z
end

which doesn't have the overhead of accessing the task_local_storage each time the closure is called. This of course will lose the safety advantages of TaskLocalValue, so you should never call f_local = promise_task_local(f) and then pass f_local to some unknown function, because if that unknown function calls f_local on a new thread, you'll hit a race condition.

However, we can take advantage of the structure of mapreduce calls to build up WithTaskLocals objects from the @tasks macro and then pass them to tmapreduce and tmap and then basically do @spawn promise_task_local(f)(args...) because we know at that point that f is actually being called so it's safe to make the promise.

Can we get these performance advantages without using the @tasks macro?

You betcha, but it's kinda ugly:

julia> function sum_accumulator_more_better(ns)
           acc = OhMyThreads.TaskLocalValue{Base.RefValue{Int}}(() -> Ref{Int}())
           f = OhMyThreads.WithTaskLocals((acc,)) do (acc,)
               function f(n)
                   acc[] = 0
                   for i  1:n
                       acc[] += i
                   end
                   acc[]
               end
           end
           tmapreduce(f, +, ns)
       end;

julia> let ns = rand(1:10, 10000)
           r1 = @btime sum_accumulator_macro($ns)
           r2 = @btime sum_accumulator($ns)
           r3 = @btime sum_accumulator_better($ns)
           r4 = @btime sum_accumulator_more_better($ns)
           r1  r2  r3  r4
       end
  4.996 μs (119 allocations: 11.40 KiB)
  98.105 μs (119 allocations: 11.38 KiB)
  26.300 μs (119 allocations: 11.38 KiB)
  4.993 μs (119 allocations: 11.38 KiB)
true

Todo

  • Docs
  • Should WithTaskLocals be exported?
  • Should WithTaskLocals be upstreamed to TaskLocalValues.jl?
  • Any API/naming concerns?

@vchuravy
Copy link
Member

vchuravy commented Mar 2, 2024

One thing we can (and should eventually do) is to apply the optimizations we do for ScopedValue to TaskLocalValue. The compiler folds repeated accesses.

@MasonProtter
Copy link
Member Author

MasonProtter commented Mar 2, 2024

Would that be possible in a package without compiler changes?

@carstenbauer
Copy link
Member

This looks good to me. Assuming that we can't easily get this as an optimization in the near future, I'm fine with moving on with this. To be sure, this doesn't negatively affect the performance if no TLVs are used (because promise_task_local is just the identity in this case), right? (What about maybe_rewrap in this case?)

To your questions (in the OP):

  • Should WithTaskLocals be exported?

I'd say no. We don't even export TaskLocalValue itself. Also, it's a (rather cumbersome) optimization to do manually and I think it's fine that people have to access it explicitly if they want to go through these hoops (most people won't I guess).

  • Should WithTaskLocals be upstreamed to TaskLocalValues.jl?

Maybe. I don't have a strong opinion here and would leave this decision to Valentin.

  • Any API/naming concerns?

I think the API is fine. I also think the name is fine but, since we are pretty verbose here anyways, we could also make it WithTaskLocalValues to align with TaskLocalValues exactly. It's a minor point for me though.

@MasonProtter
Copy link
Member Author

This looks good to me. Assuming that we can't easily get this as an optimization in the near future, I'm fine with moving on with this. To be sure, this doesn't negatively affect the performance if no TLVs are used (because promise_task_local is just the identity in this case), right? (What about maybe_rewrap in this case?)

yeah, that's right they're both identity ops if no TLVs are used

function maybe_rewrap(g::G, f::F) where {G, F}
g(f)
end
"""
maybe_rewrap(g, f)
takes a closure `g(f)` and if `f` is a `WithTaskLocals`, we're going
to unwrap `f` and delegate its `TaskLocalValues` to `g`.
This should always be equivalent to just calling `g(f)`.
"""
function maybe_rewrap(g::G, f::WithTaskLocals{F}) where {G, F}
(;inner_func, tasklocals) = f
WithTaskLocals(f.tasklocals) do vals
f = inner_func(vals)
g(f)
end
end

I think the API is fine. I also think the name is fine but, since we are pretty verbose here anyways, we could also make it WithTaskLocalValues to align with TaskLocalValues exactly. It's a minor point for me though.

I actually originally named it that, but it was so annoying to type that I shortened it.

@MasonProtter MasonProtter changed the title Add WithTaskLocalValues to make the handling of task local values more efficient. Add WithTaskLocals to make the handling of task local values more efficient. Mar 5, 2024
@MasonProtter
Copy link
Member Author

Okay, this should be ready to go so long as you approve of the changes I made to the task local storage docs @carstenbauer

@carstenbauer
Copy link
Member

I'll take a look later today

Copy link
Member

@carstenbauer carstenbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the minor formatting bug, LGTM.

src/types.jl Show resolved Hide resolved
@MasonProtter MasonProtter merged commit 0f9d61b into master Mar 6, 2024
10 checks passed
@MasonProtter MasonProtter deleted the WithTaskLocalValues branch March 6, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants