This is my learning project when taking Machine Learning A-Z course on Udemy. The goal is to do sematic analysis/ classification on the restaurant review texts via bag of words model.
To deterimine a given restaurant review text is negative or possitive.
The data provide by the course material.
Python version: 3.7 Packages: pandas, numpy, matplotlib.pyplot, re, nltk
Check how does the data look like:
Clean the data with the following steps:
- replace anything that is not letters into space
- make everything lower case
- customized the stop word list: excluded the word "not"
- stemming everything that is not in the stop word list
Split the data into train and test set. The model that is used for current project: Naive Bayes
confusion matrix:
accuracy: 0.67