Skip to content

This is a sample document parser and indexer built in Python to do searching on the documents corpus.

Notifications You must be signed in to change notification settings

mehta-a/Document-Parser-and-Indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Document-Parser-and-Indexer

This is a sample document parser and indexer built in Python to do searching on the documents corpus.

  1. EditDistance.py : Initial Code is a simple Edit distance Code. Currently this is not used by the Indexer, however it may be an alternative to jaccard coefficient used for spell corrections, as per the requirements.
  2. Downloading_books.py: Downloading sample documents from archive.org. This is a python code to download documents into your corpus. One can add more links to the corpus and update code accordingly.
  3. indexer+queryEngine.py: Here we build a Term-document Inverted Index, to be used for searching the corpus. The indexer also includes code to build Bigram index. Bigram Indexer is used to here to correct spell errors using Jaccard coefficient.

About

This is a sample document parser and indexer built in Python to do searching on the documents corpus.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages