add bumper to tls doc page

JuliaFolds2 · Feb 29, 2024 · d9fa194 · d9fa194
1 parent f860665
commit d9fa194
Show file tree

Hide file tree

Showing 3 changed files with 77 additions and 0 deletions.
diff --git a/docs/src/literate/tls/Project.toml b/docs/src/literate/tls/Project.toml
@@ -1,4 +1,6 @@
 [deps]
 BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
+Bumper = "8ce10254-0962-460f-a3d8-1f77fea1446e"
 OhMyThreads = "67456a42-1dca-4109-a031-0a68de7e3ad5"
+StrideArrays = "d1fa6d79-ef01-42a6-86c9-f7c551f8593b"
 ThreadPinning = "811555cd-349b-4f26-b7bc-1f208b848042"
diff --git a/docs/src/literate/tls/tls.jl b/docs/src/literate/tls/tls.jl
@@ -394,3 +394,37 @@ sort(res_nu) ≈ sort(res_channel_flipped)
 @btime matmulsums_perthread_channel_flipped($As_nu, $Bs_nu);
 @btime matmulsums_perthread_channel_flipped($As_nu, $Bs_nu; ntasks = 2 * nthreads());
 @btime matmulsums_perthread_channel_flipped($As_nu, $Bs_nu; ntasks = 10 * nthreads());
+
+# ## Bumper.jl (only for the brave)
+#
+# If you are bold and want to cut down temporary allocations even more you can
+# give [Bumper.jl](https://github.com/MasonProtter/Bumper.jl) a try. Essentially, it
+# allows you to *bring your own stacks*, that is, task-local bump allocators which you can
+# dynamically allocate memory to, and reset them at the end of a code block, just like
+# Julia's stack.
+# Be warned though that Bumper.jl is (1) a rather young package with (likely) some bugs
+# and (2) can easily lead to segfaults when used incorrectly. It can make sense to use it
+# though if you can live with the risk and really can't avoid allocating many (many) times
+# on each parallel task. For our example, this isn't the case but let's nonetheless how one
+# would use Bumper.jl here.
+
+using Bumper
+using StrideArrays # makes things a little bit faster
+
+function matmulsums_bumper(As, Bs)
+    N = size(first(As), 1)
+    tmap(As, Bs) do A, B
+        @no_escape begin # promising that no memory will escape
+            C = @alloc(Float64, N, N) # from bump allocater (fake "stack")
+            mul!(C, A, B)
+            sum(C)
+        end
+    end
+end
+
+res_bumper = matmulsums_bumper(As, Bs);
+res ≈ res_bumper
+
+@btime matmulsums_bumper($As, $Bs);
+
+# Compare this, especially the total allocated memory, to the variants above.
diff --git a/docs/src/literate/tls/tls.md b/docs/src/literate/tls/tls.md
@@ -514,6 +514,47 @@ Quick benchmark:
 
 ````
 
+## Bumper.jl (only for the brave)
+
+If you are bold and want to cut down temporary allocations even more you can
+give [Bumper.jl](https://github.com/MasonProtter/Bumper.jl) a try. Essentially, it
+allows you to *bring your own stacks*, that is, task-local bump allocators which you can
+dynamically allocate memory to, and reset them at the end of a code block, just like
+Julia's stack.
+Be warned though that Bumper.jl is (1) a rather young package with (likely) some bugs
+and (2) can easily lead to segfaults when used incorrectly. It can make sense to use it
+though if you can live with the risk and really can't avoid allocating many (many) times
+on each parallel task. For our example, this isn't the case but let's nonetheless how one
+would use Bumper.jl here.
+
+````julia
+using Bumper
+using StrideArrays # makes things a little bit faster
+
+function matmulsums_bumper(As, Bs)
+    N = size(first(As), 1)
+    tmap(As, Bs) do A, B
+        @no_escape begin # promising that no memory will escape
+            C = @alloc(Float64, N, N) # from bump allocater (fake "stack")
+            mul!(C, A, B)
+            sum(C)
+        end
+    end
+end
+
+res_bumper = matmulsums_bumper(As, Bs);
+res ≈ res_bumper
+
+@btime matmulsums_bumper($As, $Bs);
+````
+
+````
+  786.991 ms (275 allocations: 34.50 KiB)
+
+````
+
+Compare this, especially the total allocated memory, to the variants above.
+
 ---
 
 *This page was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*