GitHub - appeler/colornumber: Predict the distribution of race/ethnicity for a last name

Color Number

Models that predict race/ethnicity based on the name often formalize the problem as a classification problem (see Sood and Laohaprapanon 2018, etc.). However, if name is the only thing we know about a person, there is generally no unique mapping to a race/ethnicity. Instead, there is a distribution, e.g., XX% identify as White, YY% as Asian, etc. Posing the regression problem as a classification involves making one of two choices---classifying to the mode, which involves losing data, or keeping the training data in a way that there is no unique mapping between a name (string) and race (Sood and Laohaprapanon 2018 choose this option). (To get calibrated probability estimates, the training data needs to be a random sample of the population though calibration for less popular names is likely to be poor given variability stemming from sampling.) A simpler (better) way to formalize the problem is to formalize it as a multi-value regression problem --- predict the race/ethnic distribution of each name. Using the Florida Voting Registration Data for 2022, we estimate a set of models that predict the distribution of race/ethnicity per name.

Our y variable is multi-output:

p_asian, p_white, p_black, p_hispanic, p_other

Our input variable is the name string.

After estimation, we normalize the outputs for it to sum of 1.

We produce the data by grouping data by last_name and producing the data. We then split the data into train/test at 80/20. We use a loss function that is the mean absolute squared loss. (We also try L1Loss and cross-entropy loss.) We fit a MLP, LSTM, and a transformer model to predict the distribution of probabilities.

Todo

We compare how treating the same problem as a classification problem leads to performance differences in the performance metric of interest: mean absolute difference to the underlying prob. distribution. (We also try cross-entropy.)

Scripts

Authors

Gaurav Sood

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Color Number

Todo

Scripts

Authors

About

Releases

Packages

Languages

appeler/colornumber

Folders and files

Latest commit

History

Repository files navigation

Color Number

Todo

Scripts

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages