-
Notifications
You must be signed in to change notification settings - Fork 17
Document Store Design
fccoelho edited this page Jul 16, 2012
·
7 revisions
The document store will be based on the following schema:
=========
{"_id":"ObjectId",
"text":"raw utf8 text of the document",
"filename":"filename as stored in the GRIDFS",
}
Each document ins this collection is a pre-computed analysis on a single document. Number of documents in this collection is expected to be |documents|x|analyses|.
{"_id":"ObjectId",
"doc_id": "Id of the document this analysis is based on",
"type": "Pos_tag"|"tf"|"bigrams",
"data": "Result of the analysis,
}
Each document in this collection is a corpus.
{"_id":"ObjectId",
"slug": "slug name of the corpus",
"name": "Full name of the corpus",
"documents": [doc_id1, doc_id25, etc],
"tf-idf": {term:tf-idf(term) for term in corpus},
"champions": {term:[list of top-scoring doc_ids]}, # defined for the the official weighting
"df": {term:df(term) for term in corpus},
"cf": {term: cf(term) for term in corpus}
"entropy": {doc_id:entropy(doc_id) for doc_id in corpus}, # 1+sum(pij*log(pij)/log(n)), where pij = tf/cf. See http://en.wikipedia.org/wiki/Latent_semantic_indexing
}
{"words":[], # list of words
}
each document in this collection is a thematic collection of names, e.g.: proper names, food names, company names, etc.
{"_id":"ObjectId",
"type": "company names",
"words":[(canonical_name,[list of variations])],
}
etc. please add other colletion types as needed