new index

Index format

Hunt defines a ContextIndex as top level index. A ContextMap maps Contexts to their actual index implementations (subsequently called term index, to not confuse with the many sorts of indices introduced here). In addition to the termin indices a DocTable is stored. It maps unique DocIds to the actual Documents. Currently the ContextIndex is parameterized over the DocTable type to allow for different storage backends. The default implementation is an in-memory Map.

type ContextMap = Map Context (ContextSchema, IndexImpl)

data ContextIndex dt = CIx { ciIndex :: ContextMap
                           , ciDocs  :: dt
                           }

Any newly indexed document mutates the ContextIndex and its contents is merged into the existing indices. This way we can compactly store the indexed terms and redundancy is eliminated quickly. Same goes for deletion: its just a mutation of the ContextIndex.

While being quite easy to maintain this representation makes it impossible to efficiently write indexed documents to disk. As the index structures are mutated we can't just write the changed parts to disk.

The index for a segment is structured hierarchically.

The term index (*.tix) where each term is mapped to its ocurrences
Occurence index (*.oix) where each occurrence is mapped to their positions
Position index (*.pix) lists each position a word appears in a document

Considerations

Term index

The term index stores the words and the occurrence pointer sequentially, order ascendingly. Common prefixes are shared from previous words.

<prefix length><suffix length><term><num occs><occ pointer>

prefix ::= varint, denotes the shared part from the previous term
suffix ::= varint, denotes the length of the actual term (modulo suffix) stored
term   ::= bytearray, utf-8 encoded byte sequence denoting the actual term
num occs ::= the number of occurrences for the term 
occ pointer :: varint, an offset into the occurrence index

Occurrence index

Stores the actual occurrences

<docid><num pos><pos pointer>
docid ::= document id where the term occurred
num pos ::= varint, number of positions belonging to this occurrence
pos pointer ::= varint, pointer into the position index

Position index

Stores the actual positions

<pos> ::= varint, position in the document

Provide feedback

Saved searches