From 193fa9f25e717ebe0f446d944223e78527ebee5d Mon Sep 17 00:00:00 2001
From: bennibolm <benjamin.bolm@gmx.de>
Date: Tue, 30 Jan 2024 14:40:03 +0100
Subject: [PATCH 1/4] Add section to docs about false sharing

---
 docs/src/performance.md | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/docs/src/performance.md b/docs/src/performance.md
index df66f451b79..a43ed40cbc7 100644
--- a/docs/src/performance.md
+++ b/docs/src/performance.md
@@ -267,3 +267,13 @@ requires. It can thus be seen as a proxy for "energy used" and, as an extension,
     timing result, you need to set the analysis interval such that the
     `AnalysisCallback` is invoked at least once during the course of the simulation and
     discard the first PID value.
+
+## Performance issues due to false sharing
+False sharing is a known performance issue for with distrubited caches. It also occured for
+the implementation of a thread parallel bounds checking routine for the subcell IDP limiting
+in [PR #1736](https://github.com/trixi-framework/Trixi.jl/pull/1736).
+After some [experimentation and discussion](https://github.com/trixi-framework/Trixi.jl/pull/1736#discussion_r1423881895)
+it turned out that initializing a vector of length `n * Threads.nthreads()` and only using every
+n-th entry instead of a vector of length `Threads.nthreads()` fixes the problem.
+Since there are no processors with caches over 128B, we use `n = 128B / size(uEltype)`.
+Now, the bounds checking routine of the idp limiting scales as hoped.

From c3d24c9b7d8466bc53fe60d5e789b2a506b5725a Mon Sep 17 00:00:00 2001
From: bennibolm <benjamin.bolm@gmx.de>
Date: Tue, 30 Jan 2024 14:48:42 +0100
Subject: [PATCH 2/4] Fix typos

---
 docs/src/performance.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/src/performance.md b/docs/src/performance.md
index a43ed40cbc7..3261f7aa473 100644
--- a/docs/src/performance.md
+++ b/docs/src/performance.md
@@ -269,11 +269,11 @@ requires. It can thus be seen as a proxy for "energy used" and, as an extension,
     discard the first PID value.
 
 ## Performance issues due to false sharing
-False sharing is a known performance issue for with distrubited caches. It also occured for
+False sharing is a known performance issue for with distributed caches. It also occurred for
 the implementation of a thread parallel bounds checking routine for the subcell IDP limiting
 in [PR #1736](https://github.com/trixi-framework/Trixi.jl/pull/1736).
 After some [experimentation and discussion](https://github.com/trixi-framework/Trixi.jl/pull/1736#discussion_r1423881895)
 it turned out that initializing a vector of length `n * Threads.nthreads()` and only using every
 n-th entry instead of a vector of length `Threads.nthreads()` fixes the problem.
 Since there are no processors with caches over 128B, we use `n = 128B / size(uEltype)`.
-Now, the bounds checking routine of the idp limiting scales as hoped.
+Now, the bounds checking routine of the IDP limiting scales as hoped.

From 6fd8f778f402f1266df34a7e7064aa4d921ab334 Mon Sep 17 00:00:00 2001
From: bennibolm <benjamin.bolm@gmx.de>
Date: Tue, 30 Jan 2024 15:02:04 +0100
Subject: [PATCH 3/4] Fix typo

---
 docs/src/performance.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/src/performance.md b/docs/src/performance.md
index 3261f7aa473..8e0b53682e9 100644
--- a/docs/src/performance.md
+++ b/docs/src/performance.md
@@ -269,10 +269,10 @@ requires. It can thus be seen as a proxy for "energy used" and, as an extension,
     discard the first PID value.
 
 ## Performance issues due to false sharing
-False sharing is a known performance issue for with distributed caches. It also occurred for
-the implementation of a thread parallel bounds checking routine for the subcell IDP limiting
+False sharing is a known performance issue for systems with distributed caches. It also occurred
+for the implementation of a thread parallel bounds checking routine for the subcell IDP limiting
 in [PR #1736](https://github.com/trixi-framework/Trixi.jl/pull/1736).
-After some [experimentation and discussion](https://github.com/trixi-framework/Trixi.jl/pull/1736#discussion_r1423881895)
+After some [testing and discussion](https://github.com/trixi-framework/Trixi.jl/pull/1736#discussion_r1423881895),
 it turned out that initializing a vector of length `n * Threads.nthreads()` and only using every
 n-th entry instead of a vector of length `Threads.nthreads()` fixes the problem.
 Since there are no processors with caches over 128B, we use `n = 128B / size(uEltype)`.

From 54114e4bf8f1d0d7d51f559c7a5303631cc6bc8b Mon Sep 17 00:00:00 2001
From: bennibolm <benjamin.bolm@gmx.de>
Date: Mon, 5 Feb 2024 11:24:50 +0100
Subject: [PATCH 4/4] Implement suggestions

---
 docs/src/performance.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/src/performance.md b/docs/src/performance.md
index 8e0b53682e9..82d7f501f63 100644
--- a/docs/src/performance.md
+++ b/docs/src/performance.md
@@ -268,9 +268,10 @@ requires. It can thus be seen as a proxy for "energy used" and, as an extension,
     `AnalysisCallback` is invoked at least once during the course of the simulation and
     discard the first PID value.
 
-## Performance issues due to false sharing
-False sharing is a known performance issue for systems with distributed caches. It also occurred
-for the implementation of a thread parallel bounds checking routine for the subcell IDP limiting
+## Performance issues with multi-threaded reductions
+[False sharing](https://en.wikipedia.org/wiki/False_sharing) is a known performance issue
+for systems with distributed caches. It also occurred for the implementation of a thread
+parallel bounds checking routine for the subcell IDP limiting
 in [PR #1736](https://github.com/trixi-framework/Trixi.jl/pull/1736).
 After some [testing and discussion](https://github.com/trixi-framework/Trixi.jl/pull/1736#discussion_r1423881895),
 it turned out that initializing a vector of length `n * Threads.nthreads()` and only using every