Negotiating Your Compensation as a Software Engineer: A Prediction Problem Using Regression Techniques
This repository is the codebase backing the above titled paper.
Open the Jupyter notebooks in the top-level directory; they already contain the relevant output. The best order to look at the notebooks would be Data Cleaning.ipynb
, then Data Preparation.ipynb
, and finally Data Prediction/Analysis - Untransformed.ipynb
. Feel free to also look at Data Prediction/Analysis - Log Transformed.ipynb
, but this notebook doesn't contain the best models and is very similar to the untransformed version. Data Random Exploration.ipynb
does not provide much value and was purely for personal exploration.
If you would like to re-run the notebooks, make sure that you have Anaconda installed with all the relevant data analysis packages and install the remaining dependencies with pip install -r requirements.txt
.
.
├── Paper.pdf
├── requirements.txt
├── Data Cleaning.ipynb
├── Data Preparation.ipynb
├── Data Prediction/Analysis - Untransformed.ipynb
├── Data Prediction/Analysis - Log Transformed.ipynb
├── Data Random Exploration.ipynb
└── data/
├── raw_comp_data.json
├── raw_comp_data.csv
├── clean_comp_data.csv
├── prepped_comp_data.csv
├── prepped_comp_data_interactions.csv
└── site-code/
This is a writeup detailing the process, techniques, and findings of this project.
This contains the necessary dependencies beyond what Anaconda provides. Currently, it contains packages necessary for fuzzy string matching.
This notebook represents the first step in the process. It takes data/raw_comp_data.json
as input, quickly generates data/raw_comp_data.csv
for records, and finally outputs data/clean_comp_data.csv
. This notebook primarily fixes inconsistent compensation inputs (salary
, stock
, and bonus
), scales them to the same numerical range, and performs some light cleaning/pruning on various other features. This notebook contains 122 lines of code.
This notebook represents the second step in the process. It takes data/clean_comp_data.csv
as input and outputs prepped_comp_data.csv
and prepped_comp_data_interactions.csv
. This notebook primarily combines seemingly similar categorical entries (e.g., 'Amazon', 'amazon', and 'Amazon Web Services'), performs filtering on the most common categorical values above certain thresholds, creates dummy variables for the filtered categoricals, normalizes numerical features with standard normal scaling, and log transforms the target values. Interaction terms are also introduced that cross (1) company
and level
, and (2) company
and location
. These values are only present in prepped_comp_data_interactions.csv
. At the end, some light analysis is performed on the resulting dataset. This notebook contains 186 lines of code.
This notebook represents the final step in the process, though using non-log-transformed target values. It takes prepped_comp_data_interactions.csv
as input. It performs model selection on the training set using ridge, lasso, and partial least squares regressions, on each of the salary
, stock
, and bonus
targets. Finally, it takes the winning models for each and performs prediction on the test set and evaluates the model's performance. It also analyzes the resulting coefficients from the winning regression models. This notebook contains 198 lines of code.
This notebook does the exact same thing as Data Prediction/Analysis - Untransformed.ipynb
, except using log-transformed target values. This notebook contains 182 lines of code.
This notebook has some light, incomplete experimentation with other model types, such as multilayer perceptrons (MLP). It does not contain any results relevant to the outcomes in the final paper. This notebook contains 95 lines of code.
This is data taken directly from Levels.fyi as is, from web page network calls. Unfortunately, it contains inconsistent input, such as different scaling (some give the full number and some give the number in 1000's).
This data is a direct transformation from raw_comp_data.json
that occurs in Data Cleaning.ipynb
.
This data contains clean versions of the compensation inconsistencies in raw_comp_data.csv
and is generated by Data Cleaning.ipynb
.
This is the final version of the data, ready for analysis, created by Data Preparation.ipynb
. It contains log transformed target values, standard normalized years values, and dummy variables for the categorical variables. Specifically, it contains (1) log transformed total compensation, (2) log transformed salary, (3) log transformed stock, (4) log transformed bonus, (5) standard normalized years of experience, (6) standard normalized years at the company, (7) the company dummies, (8) the location dummies, and (9) the tag/specialization dummies.
This is the final version of the data, ready for analysis, created by Data Preparation.ipynb
. It has everything that data/prepped_comp_data.csv
has, and more. Specifically, it also contains (1) the company/level interaction term dummies and (2) the company/location interaction term dummies.
This code is not directly used in the project. It comes from Levels.fyi as is, from web page network calls. It is used as reference for how best to clean the compensation values in the raw data. All the code here is in JavaScript and certain relevant snippets have been ported over to Python.