A topic modelling and similarity retrieval interface that helps you managing your documents. DocTopic uses Gensim, a popular Python library designed for implementing key NLP algorithms at scale.
- Some excellent tutorials can be found on their website. They also offer support and professional services.
- An interactive introduction to similarity search can be found here.
- Create a searchable corpus from your multilingual documents with 2 clicks.
- Use unsupervised training algorithms such as Latent Semantic Analysis and Latent Dirichlet Allocation for topic modelling purposes.
- Query your corpus to retrieve documents that are structurally similar or belong to a similar domain.
- Update your search indices with new files so that they can be retrieved later.
- Use the Jupyter notebook implementation to run the app on a remote server.
Identify relevant resources from historical project data such as:
- previous translations to be used as templates
- translation vendors who are experts in their field
- project parameters such as turn-around times, pre-processing steps, etc.
Quickly assess the similarity of files within a project to help with:
- staggered/cascading deliveries
- assigning files to multiple vendors
Classify documents automatically and create topic clusters to better understand:
- the translation needs of your customer segments
- your level of specialization and how you can use it to build your brand
DocTopic has been created with Python 3.7. It requires Gensim in addition to Numpy, Scipy and PyQt5/qtpy. You will probably want to us a virtual environment like conda. The Anaconda distribution comes with the latter packages already installed. Then:
pip install -U gensim
If you found any of the content from this repo helpful, confusing or missing, I would like to hear from you.