Skip to content

Document Store Design

fccoelho edited this page Jul 15, 2012 · 7 revisions

The document store will be based on the following schema:

Collections (or tables if on a relational db)

Documents:

=========

{"_id":"ObjectId",
 "text":"raw utf8 text of the document",
 "filename":"filename as stored in the GRIDFS",
}

Analyses:

Each document ins this collection is a pre-computed analysis on a single document. Number of documents in this collection is expected to be |documents|x|analyses|.

{"_id":"ObjectId",
 "doc_id": "Id of the document this analysis is based on",
 "type": "Pos_tag"|"tf"|"bigrams",
 "data": "Result of the analysis,
}

Corpora

Each document in this collection is a corpus.

{"_id":"ObjectId",
 "slug": "slug name of the corpus",
 "name": "Full name of the  corpus",
 "documents": [doc_id1, doc_id25, etc],
 "tf-idf": {term:tf-idf(term) for term in corpus},
 "champions": {term:[list of top-scoring doc_ids]}, # defined for the the official weighting
 "df": [df(term) for term in corpus],
 "entropy": 2.3, # see http://en.wikipedia.org/wiki/Latent_semantic_indexing
}

Stopwords

{"words":[], # list of words
 }

Named Entities

each document in this collection is a thematic collection of names, e.g.: proper names, food names, company names, etc.

{"_id":"ObjectId",
 "type": "company names",
 "words":[(canonical_name,[list of variations])], 
 }

etc. please add other colletion types as needed
Clone this wiki locally