Adult Data Set from UCI Machine Learning Repository. Also known as "Census Income" dataset. It is used to predict whether a person's income exceeds $50K/yr based on census data.
Dataset Features | |
---|---|
Number of Records: | 488442 |
Number of Variables: | 14 |
Attribute Characteristics: | Categorical, Integer |
Missing Values: | Yes |
Associated Tasks: | Classification |
Date Collected: | 1996-05-01 |
For this project I will be using some Feature Selection techniques and develop a model that can correctly classify the salary-class
(greater than or less than $50k) a person. The features of each person provided in the dataset are age
, workclass
, final-weight
, education
, education-number
, marital-status
, occupation
, relationship
, race
, sex
, capital-gain
, capital-loss
, hours-per-week
and native-country
?
Create a script to read the data from the UCI Machine Learning Repository and store the data locally in the data folder for processing the data before it is used.
Read the data into a new script to wrangle the data to use data visualization for better understand the dataset and how the different variables are distributed. Also, identify the variables that can better explain the problem.
Split the categorical variables into multiple variables so that these variables can be used in a classification algorithm. Then, select a classification machine learning algorithm and train it on the test data and then predict the salary-class
of all the people in test dataset to evaluate the model's performance. Now, try tuning the hyper-parameters to improve the model's performance. If the results applied using the algorithm is not satisfactory then apply different supervised classification algorithms to check if the algorithm is most accurate in classifying the data.
The project aims to classify people into two groups, i.e. who earn more than $50,000 and less than $50,000. People are classified based on features such as age
, workclass
, education
, marital-status
, occupation
, race
, sex
, hours-per-week
and native-country
.
The scripts should be run in the order specified below. All the arguments are specified with default values. If required, all the arguments that are read in by the scripts are specified below. By default, the raw data are in the data folder, and the processed data and images will be stored in the doc folder. The rendered documents are stored in the results folder. These can be modified using the arguments for the scripts.
-
Rscript data_read.R --train=train_url --test=test_url
-
Rscript data_processing.R --write=write_folder
-
Rscript data_summary.R --read=read_path --write=write_path
-
Rscript data_viz.R --read=read_path --write=write_path
-
python3 model.py
-
Rscript -e 'rmarkdown::render("src/report.Rmd", output_dir = "results")'
make all -> For running all the files.
make clean -> To delete all the files created.
Link to Docker Repo : income-prediction-with-census-data
Run the following commands in terminal in the given order:
-
Docker Pull Command : docker pull avinashkz/income-prediction-with-census-data
-
docker run --rm -it -v < repo local path >:/home/income-prediction avinashkz/income-prediction-with-census-data /bin/bash
-
cd home/income-prediction
-
make all
-
make clean -> To delete all the files created
-
Open income-prediction.Rproj.
-
packrat::restore()
-
conda env create -f environment.yml
-
source activate income
-
make all
-
make clean -> To delete all the files created
- Tidyverse
- Optparse
- Numpy
- Pandas
- Matplotlib
- Scikit Learn
The report for the mini project can be viewed at Report.md.