Skip to content

Document Store Design

fccoelho edited this page Jul 12, 2012 · 7 revisions

The document store will be based on the following schema:

Collections (or tables if on a relational db)

Documents:

=========

{"_id":"ObjectId",
 "text":"raw utf8 text of the document",
 "filename":"filename as stored in the GRIDFS",
}

Analyses:

========

{"_id":"ObjectId",
 "doc_id": "Id of the document this analysis is based on",
 "type": "Pos_tag"|"tf"|"bigrams",
 "data": "Result of the analysis,
}

Corpora

{"_id":"ObjectId",
 "slug": "slug name of the corpus",
 "name": "Full name of the  corpus",
 "documents": [doc_id1, doc_id25, etc],
 "tf-idf": [tf-idf(term) for term in corpus],
 "df": [df(term) for term in corpus],
 "entropy": 2.3, # see http://en.wikipedia.org/wiki/Latent_semantic_indexing
}

Stopwords

{"words":[], # list of words
 }

Named Entities

{"_id":"ObjectId",
 "type": "company names",
 "words":[(canonical_name,[list of variations])], 
 }

etc. please add other colletion types as needed
Clone this wiki locally