There are variuos type of getting dataset for this task. One of them is to use Twitter API https://developer.twitter.com/en to stream live data and store it. The other way is to use competition dataset. In this project, dataset used from https://www.kaggle.com/kazanova/sentiment140. Since Twitter didn't approve my developer account the first approach is TODO
- Download file from https://www.kaggle.com/kazanova/sentiment140, create folder datasets and put it here
- Run preprocess.py on both train and test data. This will generate a preprocessed version of the dataset.
- Run stats.py where is the path of csv generated from preprocess.py. This gives general statistical information about the dataset and will two pickle files which are the frequency distribution of unigrams and bigrams in the training dataset.
- Run model.py
- dataset_manual_raw.csv - raw dataset from link https://www.kaggle.com/kazanova/sentiment140
- freqdist - frequency list
- freqdist-bi - frequency of bigrams
- glove-seeds - Glove seeds from https://github.com/stanfordnlp/GloVe