-
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
39 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
README.adoc: | ||
The canonical asciidoc formatted README. Bite me. I don't like Markdown anymore. | ||
|
||
README.md: | ||
Created from README.adoc for crates.io. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Possum | ||
|
||
## What is it? | ||
|
||
Possum is a key-value cache stored directly on disk. It supports concurrent access from multiple processes without requiring interprocess communication or sidecar processes. It efficiently evicts data by using hole punching and sparse files. It supports read snapshots by using file cloning, a feature available on [filesystems](https://www.ctrl.blog/entry/file-cloning.html) such as Btrfs, XFS, ZFS, APFS and ReFSv2. | ||
|
||
Value reads and writes occur directly on files, allowing memory-mapping, zero-copy/splicing, vector I/O, polling and more. | ||
|
||
## Why? | ||
|
||
I couldn’t find cache implementations that supported storing directly on disk, and concurrent access from multiple processes without IPC, but also maintaining disk usage limits. They always seem to be in-memory with snapshots to disk, or single-process with the disk management (including all keys) maintained in memory too. There are plenty of single-process, and fewer concurrent key-value disk stores, but they have no regard for space limits and don’t do user-level caching well. So while in-memory key-value caches are well-supported, there is an unfilled niche for disk-caches, particularly one that can be shared by multiple applications. There are plenty of systems out there that have huge amounts of disk space that could be shared by package managers, HTTP proxies and reverse proxies, thumbnail generation, API caches and more where main memory latency isn’t needed. | ||
|
||
## Technical Details | ||
|
||
Possum maintains a pool of sparse files and a manifest file to synchronize writes and reads. | ||
|
||
Writing is done by starting a batch. Possum records keys and exposes a location to write values concurrently. Value writes can be done using file handles to allow splicing and efficient system calls to offload to disk, and by copying or renaming files directly into the cache. When writes are completed, the batch keys are committed, and the appended and new files are released back into the cache for future reads and writes. Batches can be prepared concurrently. Writing does not block reading. | ||
|
||
Reading is done by locking the cache, and recording the keys to be read. When all the keys to be read have been submitted, the regions of files in the cache containing the value data are cloned to temporary files (a very efficient, constant-time system operation) and made available for read for the lifetime of the snapshot. | ||
|
||
The committing of keys when a write batch is finalized, and the creation of a read snapshot are the only serial operations. Write batch preparation, and read snapshots can coexist in any arrangement and quantity, including concurrently with the aforementioned serial operations. | ||
|
||
Efficiently removing data, and creating read snapshots requires hole punching, and sparse files respectively. Both features are available on Windows, macOS, Solaris, BSDs and Linux, depending on the filesystem in use. | ||
|
||
## Supported systems | ||
|
||
macOS and Linux are supported. BSD should work. Solaris requires a small amount of work to complete the implementation. Windows is not supported: It has the necessary features, but [only ReFS implements block cloning](https://learn.microsoft.com/en-us/windows/win32/fileio/block-cloning). | ||
|
||
## anacrolix/squirrel | ||
|
||
I previously wrote [anacrolix/squirrel](https://github.com/anacrolix/squirrel), which can fit the role of both in-memory or disk caching due to SQLite’s VFS. However, it’s written in Go and so is less accessible from other languages. Using SQLite for value storage means large streaming writes are exclusive, even within a single transaction. Value data is serialized to the blob format, potentially across multiple linked pages, and copied multiple times as items are evicted and moved around. Direct file I/O isn’t available with SQLite, and the size of values must be known before they can be written, which can mean copying streams to temporary files before they can be written into the SQLite file. |