This project is intended to explain and report the supervised machine learning models used to predict gender on the basis of first name. The data includes 25,595 names with labels male, female, and uni-sex respectively. Feature engineering was implemented to convert the names in the dataset with special characters to ASCII equivalent, and reduce the dataset to only account for binary targets male and female. One-hot encoding was used to embed these names into 1-dimensional vectors. Finally, after splitting the dataset into training, testing, and validating sets, we used grid-search and gradient-boosting to evaluate and find the best model.
The website behind the name allows users to find the history and etymological meaning of names in various languages. We used the given names + gender dataset to train our model, which can be found here.
We first removed all rows with labels uni-sex, as we were only concerned with binary targets male and female, which we labelled 0 and 1 respectively. Then, we mapped each name with ISO Latin 1 characters to the equivalent ASCII characters using the unidecode package (further details can be found here). This left us with 24,595 names to be used to train our data.
Consider a
Each letter of a name can be one-hot encoded in the vector above. In order to make our one-hot encoding sci-kit usable, we pad enough rows with zeros so that the total number of rows is equals the length of the largest name in the dataset. In our data, the largest name is
An interactive-2d PCA projection of all our one-hot encoded names can be found here.
An interactive-3d PCA projection of all our one-hot encoded names can be found here.
We used GridSearchCV with multicore n_jobs = 5 to find the best hyper-parameters for binary classification using Knn and Logistic Regression. We found that metric = cosine and n_neighbors = 3 were the best hyper-parameters for knn, and C = 360 was the best hyper-parameter for logistic regression. Comparing validation accuracy's for both those models, along with decision tree classifier with criterion = entropy we found that decision tree classifier had the best performance.
As the validation accuracy for all our models is ranged from 0.71 to 0.74, we decided to fit an ensemble classifier model to maximize accuracy. As decision trees had the highest validation accuracy, we decided to go for the Gradient Boosting classifying model as it is an ensemble model that fits boosted decision trees by minimizing error gradient. After brute-forcing, we found that n_estimators = 10000 and learning_rate = 1.0 had the best validation accuracy at about 0.794, which was considerably better than the previous three models.
After plotting Accuracy, Precision, Recall, and Area under the ROC curve, we can see that only Decision Trees and Gradient Boosting models have all scores over the threshold 0.70. Additionally, Gradient Boosting out performs every model by a decent margin, thus making it the final model that we choose for predicting gender based on names.
In order to predict gender based on names, we used the behind the name data-set, conducted rigorous feature engineering, one-hot encoding, and splitting to make the data-set usable in scikit. After running a grid search and plotting the Knn, Logistic Regression, and Decision Trees with the most optimal hyper-parameters, we found that Decision trees had the best validation performance. Following this, we used gradient boosting ensemble classifier with 10,000 boosting stages and learning rate 1.0 to fit the best model that outperforms the previous three models by a considerable margin.