-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment with S3-FIFO eviction policy #29
Comments
@aktau Thank you for sharing the information! I think we can take a look this discussion by ben, author of Caffeine/TinyLFU: 1a1a11a/libCacheSim#20 |
I got a new laptop last week (m3 max 14-core) and ran Caffeine's benchmark using 16 threads. At close to 1B reads/s (40% of an unbounded baseline), I have to disagree with those authors that this represents a scalability bottleneck. There is also an independent hit rate analysis by a library author who is trying to adopt their algorithm with their guidance. That shows a large fluctuation where the hit rate can decrease at larger sizes. It does not appear yet to be a reliably general purpose algorithm. |
Let me give my opinion as well, since I've already gathered a lot of attempts at adopting it.
In summary, I completely disagree about the scalability advantage of S3-FIFO over the other policies. In terms of hit ratio S3-FIFO can indeed often outperform W-TinyLFU, but feels bad on lfu-friendly traces. I think S3-FIFO has a right to live, especially if someone can make something more adaptive out of it, but the thought of dropping everything and rewriting everything on W-TinyLFU is getting more and more tempting 🙂. |
Though to be honest, what was causing the hit ratio to go down in DS1 I still don't understand |
Thanks, @maypok86! It's always very interesting to hear an experience report. I can't speak for any implementation other than my own, so it would be helpful if you could also compare hit rates against Caffeine in its simulator. Could you please give that a try? I expect that static W-TinyLFU will lose on recency-biased traces, as this was a known deficiency (shared with the S3-FIFO author in 2016 here). The adaptive scheme has worked very well in all of the workloads that I have tested, where the policy is competitive with the leading algorithm. I don't expect it to always win, but for a general-purpose cache, I wanted it to robustly be near the top in any user workload. There are likely areas deserving improvement, but without data, I have to wait until there is something to analyze and discuss. If comparing just on hit rates, the best alternative policy that I have seen is LIRS2 (code, paper, video). However, LIRS (v1) is very complicated to implement correctly and debug, which is why I invested in improving TinyLFU, as it seemed to be a more promising approach that I could reasonably maintain. In their v2, it is competitive except for my adaptive stress test. Of course, in practice, one has to consider all of the other characteristics, like metadata overhead and concurrency needs. There are a lot of design tradeoffs to consider when deciding on a policy, as the best choice depends on the end goals. I would have expected S3FIFO to support lock-free reads and blocking writes based on the description. Since the worst-case eviction time is O(n), that time under the lock could become a concern. I'd probably just evict the next item after a threshold scan limit because I would be concerned about an unknown user workload hitting the worst-case behavior (akin to GC thrashing causing long pauses). The BP-Wrapper approach amortizes those costs as it samples the read requests, and the read throughput exceeds practical needs, so the time it steals helps keep the critical section bounded and more predictable. It is more complex than using a Clock-based policy, but as an engineer, this predictability in tail latencies is worth the effort when providing a library. My very limited (and likely outdated) understanding is that Go does not have a very good suite of concurrent data structures and primitives, at least compared to Java's, so it can be much more difficult to implement efficient multi-threaded code. I don't recall the author implementing a thread-safe version, so there might be oversights, and I suppose those claims should be taken skeptically. |
@maypok86 one thing i'm interested: did you compare the performance of xsync map and sharded map(throughput and gc)? from xsync author there is some overhead(puzpuzpuz/xsync#102)? If xsync map always outperforms, I think i can also switch to that. Current Theine implementation is based on sharded map and I do some optimizations on it. Also as @ben-manes suggested, you can test hit rates using Caffeine's simulator, Theine's window queue and adaptive algorithm is a little different from Caffeine's |
@Yiling-J I wanted to come with a separate issue about this (and most likely about theine benchmarks in general), but that would be a very long discussion, and it's much more important for me to finish work on otter right now. So I'll come with that, but much later 🙂 or else I was able to get my ideas to the author of memorycache, but it just took a huge amount of time. @ben-manes LIRS2 looks very smart, I'll try to look in its direction, but I'm not sure it would be easy to maintain such a thing. I also tried running the caffeine simulator and the verdict is about the same: S3-FIFO was +- 2% on all traces used in otter except S3 and DS1 and otter showed about the same hit ratio. On DS1 (6000000 capacity) for example the loss was quite serious
By the way, I was quite surprised by the difference in hit ratio between theine and caffeine on P3 (25000 capacity) 8% vs 14%. Yes, perhaps the main problems with S3-FIFO are the inability to adapt and the possibility of an element being pushed out for O(n) and any attempts to fix this can lead to a worse hit ratio. And in the short term, I don't know what to do about it. Go really still doesn't have efficient concurrent data structures, there are only implementations scattered across different libraries based on research papers or ports from other languages, but unfortunately most of them are barely supported. This is largely what motivated me to build an efficient cache library under such limited circumstances. |
It looks like the ARC traces are no longer public due to an overzealous Dropbox security restriction, and unfortunately I did not use P3 regularly or recently enough to have it locally. I emailed the author to ask if they can be made available again, but if you have all of the downloads I might need you to publish them. |
@ben-manes I just took the traces from ristretto benchmarks https://github.com/dgraph-io/benchmarks/tree/master/cachebench/ristretto/trace (better to download individually as the repository is very large). |
Hi all, I missed the fun party! This is the author of the S3-FIFO algorithm. On block workloads, especially the old ones (like the ARC ones, which are more than twenty years old), S3-FIFO may not outperform W-TinyLFU. Similar results can also be found in our paper https://dl.acm.org/doi/10.1145/3600006.3613147 With that being said, I believe if
|
Hi @ben-manes, can you give more description about the scalability benchmark you showed in the figure? I am interested to learn how you improve LRU's scalability. We did have a prototype in which we observed much better scalability than the TinyLFU in the cachelib library. |
@1a1a11a I'm certainly not Ben, but it seems that bp-wrapper and a set of wait-free lossy buffers can give even greater scalability than lock-free queues. Articles on caffeine architecture: first and second |
Cachelib mimics memcached's scheme of an exclusive lock (optionally try-lock) and a per-item time interval between policy updates (60s). That was good enough for there internal needs. Caffeine uses dynamic striped, lock-free, lossy mpsc ring buffers to sample the request events. This differs from the author's BP-Wrapper implementation where they used an isolated system with a dedicated cache (Postgres), so they leveraged thread-local buffers as the memory usage and thread counts were known up front. That didn't make sense for a general purpose library embedded into an application which might have hundreds of instances, and each with different scalability needs. It is their design but with a different implementation approach. The dynamic striping (number of read buffers) increases as contention is detected, mirroring how a scalable counter is implemented. The ring buffers capture the events cheaply and if full then it is dropped. Since a cache is probabilistic and trying to detect a usage pattern for hot-cold entries, this lossy behavior does not meaningfully impact the hit rates. Whether one uses a lossy buffer, a time threshold, a fifo clock, etc. the insight is the same that a perfect history of the access order is not required to make good predictions. The writes are also buffered using a bounded mpsc queue, which grows up to a threshold. This allows the cache to absorb most independent writes without threads contending on each other and batch the work to be replayed under the lock. These writes are a cache miss which indicated that the application had to perform an expensive load (10ms+). While in applications the write rate won't exceed the eviction rate, it can happen in a synthetic stress test. In those cases back-pressure is applied to the buffering, which assists in batching the work to avoid context switches or lock contention. Generally writes are limited by the hash table's write throughput, as shown in those benchmarks. The hash table offers lock-free reads, uses dynamically striped locks for writes, offers computations under the fine-grained lock, and protects against hash flooding by switching the buckets from linked-list to red-black trees when needed. The buffering of events allows us to schedule the draining as appropriate, either after a write or when a read buffer is full. That uses a simple state machine and a try-lock to coordinate, so the eviction policy does not suffer from lock contention. As BP-Wrapper describes itself, it "(almost) eliminates lock contention for any replacement algorithm without requiring any changes to the algorithm". This allows for more exploration of algorithms and data structures because the policies do not need to be thread-safe. The benchmark that I use has a pre-populated cache with a zipf distribution for the access requests. That creates hot spots as some entries are used much more often than others, just like a real cache would experience. A common mistake is a uniform policy which evenly spreads the contention, which would imply random replacement is the ideal policy. A skewed distribution means that locks suffer higher contention so it exacerbates that as a problems, while also benefiting contention-free implementations who can better utilize the cpu cache. In the real world, 50-70M reads/s is a good bar to reach to satisfy most users and the focus turns towards other areas like hit rate and features. |
Yes
Yes. There are more pieces, each simple, which offer greater predictability. It is a tradeoff of a little more upfront work for fewer runtime surprises.
No. If we assume zero overhead (an unbounded map) then the measured per operation overhead is,
This seems debatable and would require being more explicit about the calculations. If we assume a traditional ghost history based on a doubly-linked list, closed hashing, and 64-bit machine then If optimized for space, we can say the lower limit must be at least the 32-bit key hash (4 bytes). As in, we increase the time complexity for find / add / delete through an array of hashes. This seems to be an unlikely implementation choice and it will be higher in practice. Caffeine uses a CountMinSketch sized at The extra metadata overhead of maintaining the history seems to be, at best, equivalent. In practice it appears more likely that the frequency sketch will be a lower cost than a ghost list. The size of the ghost list is the unknown factor, e.g. @maypok86 uses 0.9x. If it was instead significantly smaller than the total capacity, e.g. 0.2x, then it could be favorable. The resident per entry metadata overhead is less clear. W-TinyLFU describes only an algorithmic structure and the paper discusses the implementation choices made by Caffeine for the purpose of an evaluation. Ristretto decided to use sampled LFU with TinyLFU (doorkeeper + count-min) to reduce the metadata overhead, based on their analysis that their use-case was frequency-biased. It appears to me that S3-FIFO and W-TinyLFU would be equivalent for resident entries in their common implementations. |
Ugh, I urgently need a reaction with popcorn on github. I'm probably not competent enough to argue about S3-FIFO and W-TinyLFU. Let me just say that if you only need find and add operations in a ghost queue, you can manage with about 18 additional bytes if the hash takes 8 bytes (if it takes 4 bytes, you need half as many bytes), but if you need to remove random items from it when changing the algorithm, things get bad |
It's not so much an argument for or against, its that there are a lot of design choices where those details matter. I find it frustrating when sweeping generalizations are made that conveniently leave out important caveats or alternatives that a reader should be aware of because it may weaken the author's thesis. As an engineer, I appreciate having a richer picture because I need to balance tradeoffs, support the usage long-term, and cope with future requirements. This in conflict with an academic audience where goals are more short-term, e.g. peer review and that typically the work is discarded after publication. When the discussions are oriented towards an engineering audience then going into more breadth and depth over the facts matters so that the tradeoffs are understood. Usually that should ends up with a mixture of ideas, like how you applied insights from both BP-Wrapper and S3-FIFO, to discover the right design fit for the target environment. |
Hi @ben-manes, Thanks for the super detailed explanations! A few comments and questions. CommentsI like the TinyLFU design, and it shows the second best results in our evaluation (with a window size of 10%). I have no doubt calling it the state-of-the-art. The figure above shows the evaluation of 6594 traces from 14 datasets (CDN, kv, and block). metadata sizeRegarding the metadata size, as a FIFO-based algorithm, if objects in the workload have a uniform size, we can implement the FIFO queues using ring buffers, and AFAIK, several production cache implementations (e.g., TigerBeetle, VMware) do use ring buffers. For the ghost entries, we implemented them as part of a bucket-based hash table to take advantage of the over-provisioned space. They are cleared out during hash collisions. This means we used no more than 8 bytes (4-byte timestamp, and 4-byte fingerprint) per ghost entry. Since there are a similar number of ghost entries as the entries in the cache, we can estimate that the ghost entries add no more than 8 bytes per cached object. Our sensitivity analysis shows that we can reduce the number of ghost entries to 40% of the cached entries with almost no impact. ScalabilityI appreciate the efforts put into making Caffeine very scalable. However, the FIFO queues in S3-FIFO do not need any fancy technique to be scalable. All we need is atomics. I implemented it in C/C++, and everything is straightforward. However, I recently realized that it is non-trivial to have scalable and efficient implementations in other languages, e.g., Java and Rust. But the scalability in S3-FIFO is fundamental in the design. Questions
I still doubt how Caffeine can achieve 1000 MQPS on 16 threads. The numbers indicate that a hash look-up and recording the request on a ring buffer takes 16 ns. But CPU L1 access is ~1 ns, and DRAM access takes 100 ns. I simply don't understand how this is possible. This indicates that most data accesses (including the hash table) happen in the L1 cache. A friend of mine implemented (not optimized) FIFO in C and benchmarked its scalability on a server with 72 cores (using a pre-populated Zipf request stream). However, she was only able to get ~100 MQPS. Happy to learn more from you! :) |
Great comments, @1a1a11a. I am very happy to see more caches moving beyond LRU, so a variety of simple, easy, or more comprehensive solutions is wonderful. There is a lot of low-hanging fruit beyond the classics. One of the differences to keep in mind is that the TinyLFU papers are closer to design patterns than algorithms. They provide a structure and techniques but only suggest implementation approaches. That means that implementations can differ; for example, cachelib uses 32-bit counters in their CountMinSketch, whereas Caffeine's is 4-bit. The variety of choices can impact metadata overhead, hit rate, throughput, and concurrency support. MetadataI don't believe that the ghost entry's cost of the hash table and FIFO queue can be dismissed to say it only takes 8 bytes (timestamp + fingerprint). A hash table typically has a load factor of around 75%, and may be higher for a cache since the maximum capacity is known. That leaves less space for the over-provisioned table size but also adds at least a pointer for the chained entry and maybe more (e.g. hash field if cached, uses a doubly linked list for faster removals, etc). The cost of the FIFO should also be accounted for as part of the ghost's overhead. A reasonable minimum estimate may be that the FIFO holds a 4-byte fingerprint, the hash table adds 8 bytes for the chain pointer, the table over-provisioning is dismissed for simplicity, the entry costs 8 bytes for the key and timestamp, and ghost region ranges from 0.4x to 0.9x of total capacity. That is then an incremental cost of 8 to 18 bytes of metadata overhead for each additional cache entry. That's great and not much higher than the 4-16 bytes per entry for TinyLFU. Scalability
Yes, that is the ideal because this micro-benchmark tries to isolate the costs to only measuring the cache in a concurrent workload. It would show bottlenecks such as due to hardware cache coherence, data dependencies, poor branch prediction, saturation of the logic units (there are multiple ALUs for every FP unit), (lack of) SIMD, waits on the store buffer, blocking and context switches, system calls, inefficient algorithms, etc.
A modern CPU core is superscalar (6 wide) and fuses multiple operations into a bundle (8 micro-ops). If we avoid data dependencies and use simple integer instructions, then we'll see higher CPU utilization as it is much more work than a single instruction per cycle. That could mean at best a CPU core can perform is 48 instructions per cycle, which at 4GHz means 192 instructions per nanosecond; or 2688 instructions per nanosecond across all 14 cores. The benchmark's overhead is thread-local work for an index increment, array lookup, loop, and operation counter. A read performs a hash table lookup, a few liveliness validation checks, hashing, and maybe an addition to a ring buffer. The hashes are very inexpensive bitwise and multiplication operations. Since the ring buffers for recording a read are lossy and drained by a single thread, they fill up very quickly and are usually skipped. In those cases there is no blocking on a lock, no CAS, no volatile writes, etc. so that at maximum load it stays close to the hash table's performance. If we introduce a cost like data dependencies (such as by using a random number generator to select the next key), we'd see this fall sharply as we no longer allow the CPU/compiler to use the hardware's full potential. I refer to BP-Wrapper as request sampling because it sheds load (policy maintenance) by not recording the read in the ring buffer. The popular items will reenter the buffer at a higher frequency so when the policy drains the read buffer it still gets to observe the heavy hitters and the policy can make a reasonable prediction without slowing down the application.
There are many easy mistakes that could severely diminish throughput.
|
I tried to update the otter benchmarks today and tried to replicate the caffeine benchmarks and there seems to be a reason for these numbers in the caffeine benchmark results. Caffeine uses 2 << 14 = 32768 capacity (cache pre-filled) and scrambled zipfian from yahoo. And on such a benchmark I got fantastic speed from otter.
But if I increase cache capacity to 2 << 20 = 2097152 (used in ristretto benchmarks), the results are already decently worse and I suspect it will be the same with caffeine.
I also don't understand why benchmarks are configured so that elements are not evicted at all. It doesn't seem to be very representative of real world loads. |
This is because it is easy to write an incorrect benchmark or to misunderstand its results. Therefore that benchmark is very narrow to answer a very specific question. In this case that is if the cache is a fundamental scalability bottleneck that may harm the application's performance. A thorough analysis should ask multiple, narrow, verifiable questions that each have their own micro-benchmark for understanding the system. In no way is that single benchmark enough, but it does answer a common question and counters the belief against LRU. That question (if the cache hit penalty is a bottleneck) and its answer does not mean that being infinitely faster is important; there is a point of diminishing returns and the answer is "yes or no" not "by how much". Caffeine does include an eviction benchmark to measure that worst case throughput. Otherwise the questions became much murkier and easy to misinterpret, so they were written as needed and not included in the repository. The concern is that if included that might imply that they are valid and trustworthy comparisons rather than exploratory throwaways during development. You should write more benchmarks for analyzing your cache, but you should also be conservative publicly to avoid others misrinterpreting the results. An example of trying to follow your logic and making an easy mistake was in RedHat's benchmark. It is much faster to perform a cache miss than a cache hit because there is no additional work for maintaining the eviction policy. That penalized caches with higher hit rates as an increased throughput was observed when there are more misses. They didn't realize this and misunderstood their measurements, so during a rewrite their newer implementation suffered very high contention on a cache hit and the lower hit rate made it appear to be an improvement. Your question is good and worth exploring, but be very careful due to how trivially easy it is to make an invalid comparison or an honest misunderstanding of the results. A few related talks that I found useful, |
Interesting idea about the benchmark separation. Let me share my yesterday's adventures (and conclusion). I think it might be useful to someone. I ran the benchmarks in the same configuration as caffeine on two versions of otter:
type Node[K comparable, V any] struct {
key K
value atomic.Pointer[V]
...
} And the results were much better (you've already seen them):
So it looks like no one in go can come close to the throughput of caffeine without additional memory overhead and pressure on gc. Otter of course already uses a set of tricks to reduce gc pressure and use extra memory, but such changes actually crosses them all out. |
We are running on different machines, so you would have to see what Caffeine's numbers were on yours. My latest numbers came from a brand new M3 Max macbook, where I bumped the benchmark from 8 to 16 threads. The three workload types would also be nice to see, and a comparison to the unbounded hash map as a control group. Java's ConcurrentHashMap is very fast so that might be what gives caffeine an advantage if it wins on your hardware. |
I’m unsure why you’d spin lock on read, but if important then you might find a stamped lock useful. Basically use a version counter so reads verify their stamp to ensure a consistent read, instead of blocking each other by acquiring exclusive access. |
Oh, and your spinlock is TAS whereas using TTAS is usually preferred. |
The problem isn't even so much the numbers compared to caffeine, but the fact that spinlock eats up a lot of time on such benchmarks (CLHT is the hash table that otter uses).
With the spinlock, the otter is about three times slower |
I think if you use a stamped spinlock then the optimistic read will usually succeed and the spinlock overhead will mostly disappear. But I’d also try TTAS first. |
@ben-manes Why am I using spinlock? To update asynchronously the received Entry from the hash table when calling Set, since there is no synchronized or anything like that in golang. And I was even able to save 4 bytes for this. Also, since it makes it necessary to atomically do getValue when calling Get on the cache instance I use the same spinlock. There is a problem with StambledLock - it takes much more memory, judging by the java implementation. |
And TTAS spinlock doesn't seem to save the situation much either func (sl *SpinLock) Lock() {
acquired := false
for !acquired {
if atomic.LoadUint32((*uint32)(sl)) == 0 {
acquired = atomic.CompareAndSwapUint32((*uint32)(sl), 0, 1)
}
}
}
|
Supposedly a black hole assignment is a trick to hint to the compiler to insert a compiler barrier. It is to ensure that func compilerBarrier() {
var dummy int
_ = dummy
}
func (n *Node[K, V]) Value() V {
for {
seq := n.lock.Load()
compilerBarrier()
if seq&1 != 0 {
runtime.Gosched()
continue
}
value := n.value // Safe read after memory barrier
newSeq := n.lock.Load()
if seq == newSeq {
return value
}
}
} |
I've tried something similar, but it doesn't work. The compiler is too smart in this case. I'll have to take the issue to the developers, because I don't really understand why it decides to put a copy value after newSeq.
|
What if you make a method scoped mutex variable and lock around the value read? It would be uncontended as dead code, but might not be eliminated or at least force the barriers. If Go follows the roach motel model it would also be safe, I think? |
Ahahahaha, that kind of thing really works. func (n *Node[K, V]) Value() V {
var mutex sync.Mutex
for {
seq := n.lock.Load()
if seq&1 != 0 {
runtime.Gosched()
continue
}
mutex.Lock()
value := n.value
mutex.Unlock()
newSeq := n.lock.Load()
if seq == newSeq {
return value
}
}
} |
wonderful! you may be able to use an |
@ben-manes Yes, it's very similar to that and everything will indeed happen on the stack. I played around a bit and found out that even this ridiculous variant works. func (n *Node[K, V]) Value() V {
var lol atomic.Uint32
for {
seq := n.lock.Load()
if seq&1 != 0 {
runtime.Gosched()
continue
}
value := n.value
lol.Store(1)
newSeq := n.lock.Load()
if seq == newSeq {
return value
}
}
} |
Yep, that's what I was suggesting and it makes sense. They are enforcing compiler barriers for release semantics but oddly not for acquire semantics. That's really convoluted, but 🤷♂️ |
Very interesting, on laptops with linux, windows and intel and amd (x64) processors (all combinations) everything works fine without additional atomics but it doesn't work on macs on arm. I wish I could get a mac on intel somewhere.... |
I believe ARM uses weak memory model https://preshing.com/20120930/weak-vs-strong-memory-models/ |
In general, it doesn't look like it, since gossa shows a rearrangement of instructions. And when adding Store, judging by the instructions, the barrier is triggered. Perhaps on some processors/architectures race condition does not work in time. But it is not certain. |
True, if the instruction has already been rearranged before running, then it is probably not relevant. I am not an expert on this, so I will just watch and learn. |
@ben-manes That is true, thread-safe data structures are rife with locks in Go. But in my experience, this is less of an issue than in Java because of goroutines (lightweight threads) and channels, which encourage message passing over memory sharing. |
Oh man, in Value method lol and mutex are allocated on the heap. |
@jbduncan it's not true, for a large number of applications in Go, Java and other languages the usual map with mutex will suffice, but if you need something more, you suddenly start to have problems. Because a map with mutex will not save you from blocking hundreds of goroutines, and sync.Map is actually useless. Also the channels cannot be called efficient in any way. In conditional Java you can take ConcurrentHashMap at once and live peacefully, but in go people start to get out of it (most often just using map sharding). The canonical go way unfortunately doesn't always work either, the simplest examples are fasthttp and VictoriaMetrics, which break the rules but achieve excellent efficiency and are very popular. For example, my company switched to VictoriaMetrics simply because it was able to handle loads that competitors couldn't and nobody cares that it has a huge amount of non-canonical code and specific optimizations. Also, when you have a well-performing, more efficient solution, everyone I know immediately asks, "Why would I use something else?" |
@maypok86 Oh, good point about Though I'm not sure what you mean by "channels cannot be called efficient in any way", so at the risk of derailing things, I was wondering if you had an example? |
@jbduncan In fact it is very hard to feel the problems with channels, unlike map with mutex and they really do their job in 99% of cases, but they use mutex internally because of what on large machines and heavy loads on data transfer through goroutines can become a bottleneck of the program. Here is an example of comparison with a more tricky queue architecture (сoncurrent producers and consumers (1:1), queue/channel size 1,000, some work done by both producers and consumers):
|
In sync.pool it uses |
It is huge because the empty sync.Pool uses the getSlow method (finds nothing and tries to find something in other pools). What's interesting, the approach from counters to determine the OS thread id is faster, but actually equals spinlocks. Benchmark results (reads slower than updates)
|
I guess then what is the impact of having a global atomic field that is blindly written to, e.g. always set a boolean false? Since it’s not loaded or in lockstep, will it cause a lot of coherence traffic as cpus trade mesi ownership of the cache line? Or will it just create a store barrier and some light traffic? I suspect a little heap allocation will be the best of the bad choices. The gc will keep getting better and eventually you may get a proper compiler fence to replace the allocation. |
Unfortunately, global atomics is making things much worse.
Allocation for each Get request confuses me because with high probability GC will spend much more time (and in fact a lot of memory will be spent on it too) to clean up these allocations than we will gain. And there are several ways of solution here
|
By the way, performance with seqlock and atomic barrier on every Get request.
And with spinlock they are like this.
|
Yep, as expected with the global. Given low write rates, @1a1a11a’s rcu approach is probably your best option design-wise. The spinlock or atomic.Value is the most pragmatic as low effort, likely good enough, and easiest to refactor if you get the missing functionality in a later go release. |
In fact, I doubt they will add anything to go that can fix seqlocks. So I tried rewriting the code to rcu, and by preliminary measurements I'm quite happy with the performance. (On 100% writes it looks like it's all bottlenecked by the queue). reads=100%,writes=0%
reads=75%,writes=25%
reads=50%,writes=50%
reads=25%,writes=75%
reads=0%,writes=100%
It seems that even for write-heavy workloads, a quick-written rcu holds up quite well. |
Anyway, if anyone is interested, seqlocks in go can be fixed with dark magic. Namely by importing this function from the go runtime. //go:linkname runtimeatomicwb runtime.atomicwb
//go:noescape
func runtimeatomicwb(ptr *unsafe.Pointer, new unsafe.Pointer) But this is too forbidden technique🙂. |
@maypok86 The original Ristretto version uses a sync pool with a buffer, making each buffer local to P so no lock is needed. However, sync pools are easily GCed, causing lose of buffers. What if I disable sync pool GC, preallocate all buffers, and make sync pool's New function to only return nil? An example of overriding sync pool: https://github.com/bytedance/gopkg/blob/main/lang/syncx/pool.go I actually write some simple demo code when reading Otter code: type SyncPoolBuffer[K comparable, V any] struct {
pool Pool
pm []*PolicyBuffers[K, V]
}
func NewSP[K comparable, V any](nodeManager *node.Manager[K, V], size int) *SyncPoolBuffer[K, V] {
b := &SyncPoolBuffer[K, V]{
pool: Pool{
New: func() any {
return nil
},
},
}
for i := 0; i < size/4; i++ {
buffer := &PolicyBuffers[K, V]{make([]node.Node[K, V], 0, capacity)}
b.pm = append(b.pm, buffer)
go b.pool.Put(buffer)
}
return b
}
func (b *SyncPoolBuffer[K, V]) Add(n node.Node[K, V]) *PolicyBuffers[K, V] {
raw := b.pool.Get()
if raw == nil {
return nil
}
pb := raw.(*PolicyBuffers[K, V])
if lb := len(pb.Returned); lb >= capacity {
return pb
} else {
pb.Returned = append(pb.Returned, n)
b.pool.Put(pb)
return nil
}
}
// Free returns the processed buffer back and also clears it.
func (b *SyncPoolBuffer[K, V]) Free(pb *PolicyBuffers[K, V]) {
pb.Returned = pb.Returned[:0]
b.pool.Put(pb)
} But the problem with this approach is also obvious: there is no contention anymore. Here I add a simple counter in afterGet: if SyncPoolBufferMode {
pb = c.syncPoolBuffer.Add(got)
} else {
pb = c.stripedBuffer[idx].Add(got)
}
if pb != nil {
c.evictionMutex.Lock()
c.policy.Read(pb.Returned)
logged += 1
c.evictionMutex.Unlock()
if SyncPoolBufferMode {
c.syncPoolBuffer.Free(pb)
} else {
c.stripedBuffer[idx].Free()
}
} and here is the throughput benchmark result: Striped RingBuffer: cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkCache/zipf_otter_reads=100%,writes=0%-8 100000000 12.25 ns/op 81640149 ops/s
==== logged 20817 SyncPool Buffer: cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkCache/zipf_otter_reads=100%,writes=0%-8 100000000 36.38 ns/op 27486156 ops/s
==== logged 5882350 What do you think about this idea? |
Yeah, this is probably theine's main problem when working with a high load. for i := 0; i < size/4; i++ {
buffer := &PolicyBuffers[K, V]{make([]node.Node[K, V], 0, capacity)}
b.pm = append(b.pm, buffer)
go b.pool.Put(buffer)
} It looks interesting and funny. It should even work, although it relies very heavily on the behavior of the scheduler. I would be careful here that the scheduler may ignore one of the P's and then the cache will always work poorly. It seems that you implemented read buffers from ristretto, but removing the gc option. I wouldn't say that gc is a problem of ristretto read buffers. Moreover, I suspect that this has almost no effect on throughput benchmarks. It may affect the hit ratio, but it needs to be checked. I haven't been able to replicate most of the ristretto charts, although the charts of my hit ratio simulator are almost exactly the same as yours. But at least I found the reason for a very small hit ratio in ristretto. In fact, I thought about the options for implementing read buffers for an incredibly long time and decided to do it in this form based on the following thoughts:
Maybe I'm wrong somewhere, but I reasoned something like this :) It's funny that I recently tried to make the read buffers dynamic and even something worked out, but I won't risk pushing and merging it :). |
@maypok86 I revisited your Ristretto issue, and finally, part of the mystery has been solved. Totally agree losses is intended on read buffer. I think I'll just borrow your striped buffer because it naturally fits the current Theine, which handles read policy updates synchronously. The modified sync pool is extremely 'unsafe' and seems too magical, and also need to figure out how to make it lossy. |
@maypok86 Here is the improved result after switching to Otter striped buffer and several other optimizations:
And this is the PR (#42), maybe you can also help review it because I borrowed your code directly and only made a small modification. |
Just saw this pass by on Hacker News: https://blog.jasony.me/system/cache/2023/08/01/s3fifo. Seems interesting. I wonder if it outperforms (not just in hit ratio, but also CPU time to calculate whether to evict) TinyLFU.
The text was updated successfully, but these errors were encountered: