GitHub - avinashkz/income-prediction: Adult Data Set from UCI Machine Learning Repository. Also known as "Census Income" dataset. It is used to predict whether a person's income exceeds $50K/yr based on census data.

Data Set

Adult Data Set from UCI Machine Learning Repository. Also known as "Census Income" dataset. It is used to predict whether a person's income exceeds $50K/yr based on census data.

Dataset Features
Number of Records:	488442
Number of Variables:	14
Attribute Characteristics:	Categorical, Integer
Missing Values:	Yes
Associated Tasks:	Classification
Date Collected:	1996-05-01

Introduction

For this project I will be using some Feature Selection techniques and develop a model that can correctly classify the salary-class(greater than or less than $50k) a person. The features of each person provided in the dataset are age, workclass, final-weight, education, education-number, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week and native-country ?

Plan of Action

Create a script to read the data from the UCI Machine Learning Repository and store the data locally in the data folder for processing the data before it is used.

Read the data into a new script to wrangle the data to use data visualization for better understand the dataset and how the different variables are distributed. Also, identify the variables that can better explain the problem.

Split the categorical variables into multiple variables so that these variables can be used in a classification algorithm. Then, select a classification machine learning algorithm and train it on the test data and then predict the salary-class of all the people in test dataset to evaluate the model's performance. Now, try tuning the hyper-parameters to improve the model's performance. If the results applied using the algorithm is not satisfactory then apply different supervised classification algorithms to check if the algorithm is most accurate in classifying the data.

Summary

The project aims to classify people into two groups, i.e. who earn more than $50,000 and less than $50,000. People are classified based on features such as age, workclass, education, marital-status, occupation, race, sex, hours-per-week and native-country.

How to Run Data Analysis

The scripts should be run in the order specified below. All the arguments are specified with default values. If required, all the arguments that are read in by the scripts are specified below. By default, the raw data are in the data folder, and the processed data and images will be stored in the doc folder. The rendered documents are stored in the results folder. These can be modified using the arguments for the scripts.

Run Scripts Individually

Rscript data_read.R --train=train_url --test=test_url
Rscript data_processing.R --write=write_folder
Rscript data_summary.R --read=read_path --write=write_path
Rscript data_viz.R --read=read_path --write=write_path
python3 model.py
Rscript -e 'rmarkdown::render("src/report.Rmd", output_dir = "results")'

Run Scripts using Makefile

make all -> For running all the files.

make clean -> To delete all the files created.

How to Run Project in Docker

Link to Docker Repo : income-prediction-with-census-data

Run the following commands in terminal in the given order:

Docker Pull Command : docker pull avinashkz/income-prediction-with-census-data
docker run --rm -it -v < repo local path >:/home/income-prediction avinashkz/income-prediction-with-census-data /bin/bash
cd home/income-prediction
make all
make clean -> To delete all the files created

Run using Packrat and conda env

Open income-prediction.Rproj.
packrat::restore()
conda env create -f environment.yml
source activate income
make all
make clean -> To delete all the files created

Software Dependencies

R

Tidyverse
Optparse

Python

Numpy
Pandas
Matplotlib
Scikit Learn

Report

The report for the mini project can be viewed at Report.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Set

Introduction

Plan of Action

Summary

How to Run Data Analysis

Run Scripts Individually

Run Scripts using Makefile

How to Run Project in Docker

Run using Packrat and conda env

Software Dependencies

R

Python

Report

About

Releases 5

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
data		data
doc		doc
packrat		packrat
results		results
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
income-prediction.Rproj		income-prediction.Rproj

License

avinashkz/income-prediction

Folders and files

Latest commit

History

Repository files navigation

Data Set

Introduction

Plan of Action

Summary

How to Run Data Analysis

Run Scripts Individually

Run Scripts using Makefile

How to Run Project in Docker

Run using Packrat and conda env

Software Dependencies

R

Python

Report

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages