-
Notifications
You must be signed in to change notification settings - Fork 3
Home
Mahaveer Rulaniya edited this page Sep 29, 2021
·
1 revision
For any data Science Problem, we are given the problem statement along with the Data set.
Now we need to understand the features or attributes of the given dataset in order to understand more about the problem
Example of Attributes/Features -
Variable | Description |
---|---|
Variable | Definition |
ID | Unique ID |
Gender | Gender of the customer |
Ever_Married | Marital status of the customer |
Age | Age of the customer |
Graduated | Is the customer a graduate? |
Profession | Profession of the customer |
Work_Experience | Work Experience in years |
Spending_Score | Spending score of the customer |
Family_Size | Number of family members for the customer (including the customer) |
Var_1 | Anonymised Category for the customer |
Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time. This includes -
- Basic Cleaning
- Getting information about attributes
- Data manipulation
- Organising of Data
If the missing data is Present in -
- Continuous variable feature - Fill Median or mean based on the distriution of feature variale
- Categorical Variable feature - Fill mode of the column in place of missing data
1. Variable Identification
- Categorical
- Ordinal
- Nominal
- Continuous
2. Univariate Analysis
- For Categorical Variable -
- Count of data present in the dataset for a particular variable
- For Continuous Variable -
- Find the Distribution of features using Histogram
- Outlier detection using Box Plot
3. Bi-Variate Analysis
- Categorical - Continuous Variables ---> Bar Graph
- Continuous - Continuous Variables ---> Scatter Plot to see relationship
4. Outlier Detection
- Box plots are the best statistical Measure for Outlier detection
5. Missing value Treatment
- Depending upon the dataset the missing value treatment can be done before EDA or after EDA process
1. Feature Engineering
- After analysing the dataset and based on the insights we add or remove some attributes so that our model performs good.
2. Feature Encoding
- Since the Machine learning model accepts only Numerical values so the categorical variables which have 'object' data types are converted into numerical values depending upon the type of variable
- Ordinal Categorical variable - Label Encoding
- Nominal Categorical Variable - One Hot Encoding
3. Feature Scaling
- Features are scaled to give the good output for some algorithms which depends on some similiarty function like -
- K-Means Clustering algorithm
- K-NN algorithm
- Principal Component Analysis
Now we select the Machine learning algorithm which can give the best model outcome predictions