-
Notifications
You must be signed in to change notification settings - Fork 10
new index
Hunt defines a ContextIndex
as top level index. A ContextMap
maps Context
s to their actual index implementations (subsequently called term index, to not confuse with the many sorts of indices introduced here). In addition to the termin indices a DocTable
is stored. It maps unique DocId
s to the actual Document
s. Currently the ContextIndex
is parameterized over the DocTable
type to allow for different storage backends. The default implementation is an in-memory Map
.
type ContextMap = Map Context (ContextSchema, IndexImpl)
data ContextIndex dt = CIx { ciIndex :: ContextMap
, ciDocs :: dt
}
Any newly indexed document mutates the ContextIndex
and its contents is merged into the existing indices. This way we can compactly store the indexed terms and redundancy is eliminated quickly. Same goes for deletion: its just a mutation of the ContextIndex
.
While being quite easy to maintain this representation makes it impossible to efficiently write indexed documents to disk. As the index structures are mutated we can't just write the changed parts to disk.
The index for a segment is structured hierarchically.
- The term index (*.tix) where each term is mapped to its ocurrences
- Occurence index (*.oix) where each occurrence is mapped to their positions
- Position index (*.pix) lists each position a word appears in a document
The term index stores the words and the occurrence pointer sequentially, order ascendingly. Common prefixes are shared from previous words.
<prefix length><suffix length><term><num occs><occ pointer>
prefix ::= varint, denotes the shared part from the previous term
suffix ::= varint, denotes the length of the actual term (modulo suffix) stored
term ::= bytearray, utf-8 encoded byte sequence denoting the actual term
num occs ::= the number of occurrences for the term
occ pointer :: varint, an offset into the occurrence index
Stores the actual occurrences
<docid><num pos><pos pointer>
docid ::= document id where the term occurred
num pos ::= varint, number of positions belonging to this occurrence
pos pointer ::= varint, pointer into the position index
Stores the actual positions
<pos> ::= varint, position in the document