In this project, supervised machine learning methods are used to predict whether an previously unseen Amazon Kindle Book Review is positive or negative based on its text content. The text analysis is done in Python and is performed on 1,000,000 reviews available on https://www.kaggle.com/datasets/bharadwaj6/kindle-reviews.
The text preprocessing requires the nltk module, which may need to be installed manually. Note that the code below was developed for Google Colab. Should you want to use the same lines, make sure that the data is stored in a folder called "data".
The most important preprocessing steps are performed by the prep function. Note that the order of the steps is important, especially that in step three the data frame changes substantially. Before step three we are dealing with a list of strings, whereas after step three each string has been decomposed into a list of words.
The function prep is work in progress and updated continuously, if you are interested in the current state of this function, just contact me. I'm happy to share my code and always welcome comments.