This project is a comparison of 9 different techniques for text classification. The goal was to answer the question:
Does this text require simplification in order to make the text more easily understood?
The techniques involved both supervised and unsupervised models for classification. The training and test texts are included in the data
folder, or can be referenced from the Kaggle web page, UMich SIADS 695 Fall21: Predicting text difficulty.
The models were examined in four stages:
- Stage One used features derived from the texts, such as level of vocabulary or Flesch-Kincaid Readability rated difficulty.
- Stage Two used the distribution of the vocabulary in the two classes.
- Stage Three explored supervised and unsupervised techniques in a reduced feature space derived from the features in Stage One.
- Stage Four explored a vector space model using cosine similarity with ideal class vectors representing each class.
A detailed discussion is available in the report.