-
Notifications
You must be signed in to change notification settings - Fork 17
Document Store Design
fccoelho edited this page Jul 12, 2012
·
7 revisions
The document store will be based on the following schema:
=========
{"_id":"ObjectId",
"text":"raw utf8 text of the document",
"filename":"filename as stored in the GRIDFS",
}
========
{"_id":"ObjectId",
"doc_id": "Id of the document this analysis is based on",
"type": "Pos_tag"|"tf"|"bigrams",
"data": "Result of the analysis,
}
{"_id":"ObjectId",
"slug": "slug name of the corpus",
"name": "Full name of the corpus",
"documents": [doc_id1, doc_id25, etc],
"tf-idf": [tf-idf(term) for term in corpus],
"df": [df(term) for term in corpus],
"entropy": 2.3, # see http://en.wikipedia.org/wiki/Latent_semantic_indexing
}
{"words":[], # list of words
}
{"_id":"ObjectId",
"type": "company names",
"words":[(canonical_name,[list of variations])],
}
etc. please add other colletion types as needed