Skip to content
Mahaveer Rulaniya edited this page Sep 29, 2021 · 1 revision

👩‍👩‍👦Problem Statement Understanding

For any data Science Problem, we are given the problem statement along with the Data set.

image2

🧐Attributes Description

Now we need to understand the features or attributes of the given dataset in order to understand more about the problem

Example of Attributes/Features -

Variable Description
Variable Definition
ID Unique ID
Gender Gender of the customer
Ever_Married Marital status of the customer
Age Age of the customer
Graduated Is the customer a graduate?
Profession Profession of the customer
Work_Experience Work Experience in years
Spending_Score Spending score of the customer
Family_Size Number of family members for the customer (including the customer)
Var_1 Anonymised Category for the customer

📌Concepts and Techniques used in the Project

image3

🪁Data Wrangling

Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time. This includes -

  • Basic Cleaning
  • Getting information about attributes
  • Data manipulation
  • Organising of Data

🪁Missing Values Treatment

If the missing data is Present in -

  • Continuous variable feature - Fill Median or mean based on the distriution of feature variale
  • Categorical Variable feature - Fill mode of the column in place of missing data

🪁Exploratory Data Analysis

image

1. Variable Identification

  • Categorical
    • Ordinal
    • Nominal
  • Continuous

2. Univariate Analysis

  • For Categorical Variable -
    • Count of data present in the dataset for a particular variable
  • For Continuous Variable -
    • Find the Distribution of features using Histogram
    • Outlier detection using Box Plot

3. Bi-Variate Analysis

  • Categorical - Continuous Variables ---> Bar Graph
  • Continuous - Continuous Variables ---> Scatter Plot to see relationship

4. Outlier Detection

  • Box plots are the best statistical Measure for Outlier detection

5. Missing value Treatment

  • Depending upon the dataset the missing value treatment can be done before EDA or after EDA process

🪁Data Preprocessing

1. Feature Engineering

  • After analysing the dataset and based on the insights we add or remove some attributes so that our model performs good.

2. Feature Encoding

  • Since the Machine learning model accepts only Numerical values so the categorical variables which have 'object' data types are converted into numerical values depending upon the type of variable
    • Ordinal Categorical variable - Label Encoding
    • Nominal Categorical Variable - One Hot Encoding

3. Feature Scaling

  • Features are scaled to give the good output for some algorithms which depends on some similiarty function like -
    • K-Means Clustering algorithm
    • K-NN algorithm
    • Principal Component Analysis

✂Machine Learning Model Building

Now we select the Machine learning algorithm which can give the best model outcome predictions