-
Notifications
You must be signed in to change notification settings - Fork 2
Student Projects
A list of Google Summer of Code and student thesis projects for Gensim, a scientific Python package for efficient, large-scale topic modelling.
We offer financial reward as well as technical and academic assistance for completing these projects. Expectations are high though; read this general summary before applying.
If you'd like to work on any of the topics below, or have your own ideas, get in touch at [email protected].
Background:
Non-negative matrix factorization, NNMF [1], is a popular machine learning algorithm, widely used in collaborative filtering and natural language processing. It can be phrased as an online learning algorithm. [2]
While implementations of NNMF in Python exist [3, 4], they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications. You will contribute a scalable implementation of NNMF to the Python data science world. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals:
-
Demonstrate understanding of matrix factorization theory and practice, by describing, implementing and evaluating a scalable version of the NNMF algorithm.
-
Implement streamed NNMF [5] that is capable of online (incremental) updates. Model training must proceed in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally also implement a version that can use multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
Deliverables:
-
Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
-
Report: timings and accuracy of your NNMF implementation on English Wikipedia and the Lee corpus [8] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of your NNMF implementation. You can also evaluate the NNMF factorization quality against other factorization methods, such as SVD and LDA [9] in collaborative filtering settings (optional).
Resources:
[2] Online algorithm
[3] Christian Thurau et al. "Python Matrix Factorisation"
[4] Sklearn NMF code
[7] Gensim on github
[8] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society
[9] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010
[11] Topics extraction with Non-Negative Matrix Factorization in sklearn
[12] Gensim github issue #132.
Background: Explicit Semantic Analysis [1, 2] is a method of unsupervised document analysis using Wikipedia as a resource. It has many applications, for example event classification on Twitter [3].
While implementations of ESA exist in Python [4] and other languages [5], they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications.
You will contribute a scalable implementation of ESA to the Python data science world. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals:
-
Demonstrate understanding of semantic interpretation theory and practice, by describing, implementing and evaluating a scalable version of the ESA algorithm.
-
Implement streamed ESA that is capable of online (incremental) updates. Model training must proceed in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
Deliverables
-
Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
-
Report: timings and accuracy of your ESA implementation on the Lee corpus [8] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of your ESA implementation. You can also evaluate the ESA against other methods of semantic analysis, such as Latent Semantic Analysis [9, 10] in an event classification task (optional).
Resources:
[1] Evgeniy Gabrilovich and Shaul Markovitch "Wikipedia-based Semantic Interpretation for Natural Language Processing." Journal of Artificial Intelligence Research, 34:443–498, 2009
[2] Explicit Semantic Analysis.
[3] Musaev, A.; De Wang; Shridhar, S.; Chien-An Lai; Pu, C., "Toward a Real-Time Service for Landslide Detection: Augmented Explicit Semantic Analysis and Clustering Composition Approaches," in Web Services (ICWS), 2015 IEEE International Conference on , vol., no., pp.511-518, June 27 2015-July 2 2015
[4] Python implementation of ESA
[7] Gensim on github
[8] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society
[9] "Latent Semantic Analysis" article on Wikipedia
[10] Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188
Note: Consider integration with existing Python sLDA
Background: Supervised Latent Dirichlet Allocation (sLDA) [1] is a Natural Language Processing method based on Latent Dirichlet Allocation (LDA) [2]. It is used in predicting the number of "Likes" for a post or the number of stars in a movie review.
In the vanilla LDA we treat the topic proportions for a text document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic. In Supervised Latent Dirichlet Allocation (sLDA), we add our target variable to the LDA model. For example, the number of stars assigned in a movie review or number of "Likes" of a post.
While academic implementations of sLDA exist in C++ and R [3, 4], there is no Python implementation available. You will contribute a scalable implementation of sLDA to the Python data science world. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals
-
Demonstrate understanding of topic modeling theory and practice by describing, implementing and evaluating sLDA.
-
Implement a streamed sLDA that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
Deliverables
-
Code: a pull request against gensim [5, 6] on github. [7] Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
-
Report: timings, memory use and accuracy of your sLDA implementation on the Cornell Movie Review Corpus [8] following the same methodology as in [1]. A summary of insights into parameter selection and tuning of sLDA.
Resources:
[3] sLDA implementation in C++
[4] Implementation of sLDA in R
[7] Gensim on github
[8] Movie Review Dataset from Cornell NLP group
Background: Word2Vec [1, 2] is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen
.
This original Word2Vec algorithm can't add more words to vocabulary after an initial training. This is quite limiting for a news recommender engine encountering new words every day, for example. Many other real-world uses will benefit from being able to add new words to the vocabulary during training. This modification is called an online-training [3] of a Word2vec model.
There is no robust implementation of Online Word2vec available in any programming language. You will contribute a scalable implementation of Online Word2Vec to the data science world in Python. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals
-
Demonstrate understanding theory and practice of distributed representations of words by describing, implementing and evaluating Online word2vec.
-
Implement a streamed Online word2vec that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
Deliverables
-
Code: a pull request against gensim [4] on github [5]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
-
Report: timings, memory use and accuracy of your Online word2vec using Lee corpus [6] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of Online word2vec.
Resources: [1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)
[2] Gensim word2vec tutorial at Kaggle
[3] Online algorithm
[5] Gensim on github
[6] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society
Background: Word2Vec [1, 2] is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen
.
Many methods are proposed on how to measure distance between sentences in this new vector space. "Word Mover's Distance" (WMD) [3] is a novel distance-between-text-documents measure. It outperforms simple combinations like sum or mean. Visually, the distance between the two documents is the minimum cumulative distance that all words in document A need to travel to exactly match document B.
For example, these two sentences are close with respect to WMD even though they only have one word in common: "The restaurant is loud, we couldn't speak across the tabel" and "The restaurant has a lot to offer but easy conversation is not there". [4]
While there is an academic implementation in C [5], there is no implementation of WMD available in Python. You will contribute a scalable implementation of WMD to the data science world in Python. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals
-
Demonstrate understanding theory and practice of document distances by describing, implementing and evaluating WMD.
-
Implement the WMD. Processing must be done in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
Deliverables
-
Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
-
Report: timings, memory use and accuracy of your WMD using the freely available datasets in [3], for example the "20 newsgroups" corpus [8]. A summary of insights into parameter selection and tuning of document distances.
Resources:
[2] Gensim word2vec tutorial at Kaggle
[3] "From Word Embeddings to Document Distances" Kusner et al 2015
[4] [Sudeep Das "Navigating themes in restaurant reviews with Word Mover’s Distance", 2015] (http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/)
[5] Matthew J Kusner's WMD in C on github
[7] Gensim on github
Background: Author-topic model [2] is a Natural Language Processing method that tells us about a person's writing. It can say how diverse is a range of topics covered by one author. It can also compare two authors and say how similar they are.
Best implementation is CVB below.
The author-topic model adds information about an author into very popular Latent Dirichlet Allocation (LDA) [6] model.
While there are academic implementations in Python and other languages [3, 4], they are very slow for large datasets. You will contribute a scalable implementation of Author-topic modelling to the data science world in Python. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals
-
Demonstrate understanding of theory and practice of topic modelling by describing, implementing and evaluating author-topic modelling.
-
Implement a streamed author-topic model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
-
A very interesting point here is adapting a Gibbs sampling paper to use Gensim's variational inference.
Deliverables
-
Code: a pull request against gensim [1] on github [2]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
-
Report: timings, memory use and accuracy of your author-topic model using the NIPS papers dataset [5], following the methodology of [2]. A summary of insights into parameter selection and tuning of the model.
Resources: [1] Rosen-Zvi, Michal, et al. "The author-topic model for authors and documents." Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004. PDF.
[2] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010.
[3] Author-topic model in Python
[6] Gensim on github
[7] NIPS text corpus in MATLAB format
[8] Collapsed VB implementation
Background: Latent Dirichlet Allocation (LDA) [1] is a very popular algorithm for modelling topics of text documents.
Modern data mining relies on high-level distributed [2] frameworks like Hadoop, Spark [3], Celery [4], Disco [5], Samza [6] and Ibis [7].
While there are implementations of distributed LDA in Scala over Spark and in other languages, there is no established distributed computing framework that contains an LDA implementation in Python. You will contribute a scalable implementation of distributed LDA to the data science world in Python, building on top of one of the existing distributed frameworks. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals
-
Demonstrate understanding of theory and practice of distributed computing and topic modelling by describing, implementing and evaluating distributed LDA.
-
Implement a streamed distributed LDA model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. By integrating with one of the existing distributed frameworks, it must simultaneously use multiple machines and multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
Deliverables
-
Code: a pull request against gensim [8] on github [9]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. Gensim contains a very manual low-level distributed implementation of LDA [8] that you can build on.
-
Report: timings, memory use and accuracy of your distributed LDA implementation on the English Wikipedia corpus. A summary of insights into parameter selection and tuning of the model. In particular, how performance changes by adding cores and machines to the cluster.
Resources:
[1] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010
[2] MapReduce: Simplified Data Processing on Large Clusters
[3] Spark distributed computing framework
[4] Celery
[5] Disco
[6] Storm, Samza.
[7] Ibis
[9] Gensim on github
[10] Low-level distributed LDA in gensim
Background: Latent Semantic Indexing (LSI) [1] is a very popular algorithm for modelling topics of text documents.
Modern data mining relies on high-level distributed [2] frameworks like Hadoop, Spark [3], Celery [4], Disco [5], Samza [6] and Ibis [7].
While there are implementations of distributed LSI in Scala over Spark and in other languages, there is no established distributed computing framework that contains an LSI implementation in Python. You will contribute a scalable implementation of distributed LSI to the data science world in Python, building on top of one of the existing distributed frameworks. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals
-
Demonstrate understanding theory and practice of distributed computing and topic modelling by describing, implementing and evaluating distributed LSI.
-
Implement a streamed distributed LSI model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. By integrating with one of the existing distributed frameworks, it must simultaneously use multiple machines and multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
Deliverables
-
Code: a pull request against gensim [8] on github [9]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. Gensim contains a very manual low-level distributed implementation of LSI [10] that you can build on.
-
Report: timings, memory use and accuracy of your distributed LSI implementation on the English Wikipedia corpus. A summary of insights into parameter selection and tuning of the model.
Resources:
[1] Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188
[2] MapReduce: Simplified Data Processing on Large Clusters
[3] Spark distributed computing framework
[4] Celery
[5] Disco
[6] Storm, Samza.
[7] Ibis
[9] Gensim on github
[10] Low-level distributed LSI in gensim
[11] LSI on Spark
Background: Word2Vec [1, 2] is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen
.
Modern data mining relies on high-level distributed [3] frameworks like Hadoop, Spark [4], Celery [5], Disco [6], Samza [7] and Ibis [8].
While there are implementations of distributed word2vec in Scala over Spark [9] and in other languages [10], there is no established distributed computing framework that contains a word2vec implementation in Python. You will contribute a scalable implementation of distributed word2vec to the data science world in Python, building on top of one of the existing distributed frameworks. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals
-
Demonstrate understanding theory and practice of distributed computing and word representations by describing, implementing and evaluating distributed word2vec.
-
Implement a streamed distributed word2vec model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. By integrating with one of the existing distributed frameworks, it must simultaneously use multiple machines and multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
Deliverables
-
Code: a pull request against gensim [11] on github [12]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. Gensim contains a very manual low-level distributed implementation of distributed word2vec that you can build on.
-
Report: timings, memory use and accuracy of your distributed word2vec implementation on the English Wikipedia corpus. A summary of insights into parameter selection and tuning of the model.
Resources:
[2] Gensim word2vec tutorial at Kaggle
[3] MapReduce: Simplified Data Processing on Large Clusters
[4] Spark distributed computing framework
[5] Celery
[6] Disco
[7] Storm, Samza.
[8] Ibis
[10] word2vec in DeepLearning4J
[12] Gensim on github
WordRank is a new word embedding algorithm.
Investigate how it compares to word2vec by expanding on the approach in this blog.
See https://github.com/RaRe-Technologies/gensim/issues/665
##LargeVis
A technique for Visualizing Large-scale and High-dimensional Data. Faster than t-SNE!
Code in https://github.com/lferry007/LargeVis
https://arxiv.org/abs/1602.00370
Very useful in non-English languages.
The paper mentions some Recurrent Neural Network code using blocks
package.
http://arxiv.org/pdf/1608.01056.pdf
Much better performance than current variational inference way to fit LDA.
Either implement in Python or find a way to load the model trained on Spark.
Shows how good your word2vec model is on specific syntactic and semantic tasks. Wrapper around this code https://github.com/ytsvetko/qvec
A sense embedding is able to learn multiple representations per word capturing different word meanings.
Integrate one of existing word sense embeddings into gensim. Adagram is the best one currently.
Low priority as rarely appears in production.
Consider:
https://research.googleblog.com/2016/08/text-summarization-with-tensorflow.html
Translate from R into Python using existing Gensim code. Medium difficulty.
From gensim issue suggestion: "Hi, it seems that wordspace model is very useful (http://infomap-nlp.sourceforge.net/doc/algorithm.html and https://cran.r-project.org/web/packages/wordspace/index.html). It is similar to the lsa model except that wordspace decomposes a co-occurrence matrix instead of term-document matrix."
Change HashDictionary to use cuckoo hashing.
Hat-tip to A. Mueller
Bidirectional LSTM Recurrent Neural Network" paper
See paper