We describe WiSER, a clean-slate search engine designed to exploit high-performance SSDs with the philosophy "read as needed". WiSER utilizes many techniques to deliver high throughput and low latency with a relatively small amount of main memory; the techniques include an optimized data layout, a novel two-way cost-aware Bloom filter, adaptive prefetching, and space-time trade-offs. In a system with memory that is significantly smaller than the working set, these techniques increase storage space usage (up to 50%), but reduce read amplification by up to 3x, increase query throughput by up to 2.7x, and reduce latency by 16x when compared to the state-of-the-art Elasticsearch. We believe that the philosophy of "read as needed" can be applied to more applications as the read performance of storage devices keeps improving.
WiSER was called Vacuum. Because of this, you will see the name 'vacuum' a lot in this repo.
The paper about WiSEr was published at FAST'20. The title is "Read as Needed: Building WiSER, a Flash-Optimized Search Engine". http://pages.cs.wisc.edu/~jhe/fast20-wiser.pdf
(Feb 19, 2020: I (Jun) am not a grad student anymore. It would be great if someone could help us to improve this repository. Otherwise, I'll try to find some spare time...)
(Jan 12, 2020: Update: We will improve this repos to make it easy to run. Vaccum is well tuned and runs pretty fast.)
The main C++ code of Vaccuum is in src/qq_mem/
. We also have lots of experimental code in the repository, at least for now.
data/
Data for benchmarking and some scripts to manipulate the data.scripts/
A bunch of Python and Shell scripts for our experiments and setup.src/
lucene/
a copy of lucene code. We played with it.pysrc/
some Python codebenchmarks/
scripts for benchmarking redisearch, elasticsearch, ...in_mem/
we developed a minimal python in-memory engine here.
qq_mem/
this is the main direcotry for Vaccuum. We have the name "qq_mem" because things evolve and we are too lazy to change directory names.src/
Vacuum source code (C++)tools/
A bunch of helper scripts- `README.md Instruction on how to run Vacuum
tutorials/
this has some Lucene examples that we played with.
We evaluate search engines by synthetic and real queries. The synthetic queries can be generated by src/qq_mem/tools/gen_synthetic_log.py
. real queries can be find at http://www.wikibench.eu/.
Basically, what we did was sample terms in Wikipedia by their frequencies.
Please see src/qq_mem/.
Please contact Jun He ([email protected]) and Kan Wu ([email protected]) if you have any questions.