From the given database
response.csv
Find out the personality traits using this personality prediction project.
The system will predict one's personality and their traits through basic survey.This system will help the human resource to select right candidate for desired job profile, which in turn provide expert workforce for the organization.
Factor analysis has been used in the study of human intelligence and human personality as a method for comparing the outcomes of (hopefully) objective tests and to construct matrices to define correlations between these outcomes, as well as finding the factors for these results. The field of psychology that measures human intelligence using quantitative testing in this way is known as psychometrics (psycho=mental, metrics=measurement).
- Offers a much more objective method of testing traits such as intelligence in humans
- Allows for a satisfactory comparison between the results of intelligence tests
- Provides support for theories that would be difficult to prove otherwise
Refine the Data
Prepare the Data
Choose the Factor
variable
correlation matrix
using any method of factor analysis such as EFA
Decide no. of factors
factor loading of factors
rotation of factor loadings
provide appropriate no. of factors
Import all libraries which we needed to perform this
python code
#Librerias
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Make Dataframe using Pandas
and shaped it
#Data
df = pd.read_csv("responses.csv")
df.shape
Out: (1010, 15)
In response.csv
[1010 rows, 150 rows]
Which means this data collected by surveying 1010 individuals and there is 150 types of different prefrence & fields.
MUSIC PREFERENCES (19) 0:19
MOVIE PREFERENCES (12) 19:31
HOBBIES & INTERESTS (32) 31:63
PHOBIAS (10) 63:73
HEALTH HABITS (3) 73:76
PERSONALITY TRAITS, VIEWS ON LIFE & OPINIONS (57) 76:133
SPENDING HABITS (7) 133:140
DEMOGRAPHICS (10 ) 140:150
We will take only: PERSONALITY TRAITS, VIEWS ON LIFE & OPINIONS (57) 76:133
df = df.iloc[:, 76:133]
df.head(5)
Out:
Daily events | Prioritising workload | Writing notes | Workaholism | Thinking ahead | Final judgement | Reliability | Keeping promises | Loss of interest | Friends versus money | ... | Happiness in life | Energy levels | Small - big dogs | Personality | Finding lost valuables | Getting up | Interests or hobbies | Parents' advice | Questionnaires or polls | Internet usage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.0 | 2.0 | 5.0 | 4.0 | 2.0 | 5.0 | 4.0 | 4.0 | 1.0 | 3.0 | ... | 4.0 | 5.0 | 1.0 | 4.0 | 3.0 | 2.0 | 3.0 | 4.0 | 3.0 | few hours a day |
1 | 3.0 | 2.0 | 4.0 | 5.0 | 4.0 | 1.0 | 4.0 | 4.0 | 3.0 | 4.0 | ... | 4.0 | 3.0 | 5.0 | 3.0 | 4.0 | 5.0 | 3.0 | 2.0 | 3.0 | few hours a day |
2 | 1.0 | 2.0 | 5.0 | 3.0 | 5.0 | 3.0 | 4.0 | 5.0 | 1.0 | 5.0 | ... | 4.0 | 4.0 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 | 3.0 | 1.0 | few hours a day |
3 | 4.0 | 4.0 | 4.0 | 5.0 | 3.0 | 1.0 | 3.0 | 4.0 | 5.0 | 2.0 | ... | 2.0 | 2.0 | 1.0 | 2.0 | 1.0 | 1.0 | NaN | 2.0 | 4.0 | most of the day |
4 | 3.0 | 1.0 | 2.0 | 3.0 | 5.0 | 5.0 | 5.0 | 4.0 | 2.0 | 3.0 | ... | 3.0 | 5.0 | 3.0 | 3.0 | 2.0 | 4.0 | 3.0 | 3.0 | 3.0 | few hours a day |
5 rows × 57 columns
#Drop NAs
df = df.dropna()
#...............................................................................................
#Encode categorical data
from sklearn.preprocessing import LabelEncoder
df = df.apply(LabelEncoder().fit_transform)
df
dropna()
method will remove Null value from dataframe.
Why are we encoding the data?
In order to analys data require all i/p & o/p variable to be nummeric. This means that if our data contains categorical dat, we must encode it to number before you can fit and evalute a model.
There is two type of encoding
Integer encoding
each unique label is mapped to an integer.
One hot encoding
It refers to splitting the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed.
Before Encoding | After Encoding |
---|---|
Height | Height |
Tall | 0 |
Short | 1 |
Medium | 2 |
Medium | 2 |
Short | 1 |
Tall | 0 |
Here, We have used One hot encoding
.
Out:
Daily events | Prioritising workload | Writing notes | Workaholism | Thinking ahead | Final judgement | Reliability | Keeping promises | Loss of interest | Friends versus money | ... | Happiness in life | Energy levels | Small - big dogs | Personality | Finding lost valuables | Getting up | Interests or hobbies | Parents' advice | Questionnaires or polls | Internet usage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 4 | 3 | 1 | 4 | 3 | 3 | 0 | 2 | ... | 3 | 4 | 0 | 3 | 2 | 1 | 2 | 3 | 2 | 0 |
1 | 2 | 1 | 3 | 4 | 3 | 0 | 3 | 3 | 2 | 3 | ... | 3 | 2 | 4 | 2 | 3 | 4 | 2 | 1 | 2 | 0 |
2 | 0 | 1 | 4 | 2 | 4 | 2 | 3 | 4 | 0 | 4 | ... | 3 | 3 | 2 | 2 | 2 | 3 | 4 | 2 | 0 | 0 |
4 | 2 | 0 | 1 | 2 | 4 | 4 | 4 | 3 | 1 | 2 | ... | 2 | 4 | 2 | 2 | 1 | 3 | 2 | 2 | 2 | 0 |
5 | 1 | 1 | 2 | 2 | 2 | 0 | 2 | 3 | 2 | 1 | ... | 2 | 3 | 3 | 2 | 2 | 2 | 4 | 2 | 3 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1005 | 2 | 1 | 0 | 3 | 1 | 2 | 2 | 2 | 3 | 3 | ... | 3 | 2 | 2 | 2 | 3 | 4 | 3 | 3 | 2 | 0 |
1006 | 0 | 2 | 0 | 4 | 4 | 4 | 4 | 3 | 0 | 1 | ... | 3 | 3 | 2 | 4 | 2 | 0 | 2 | 3 | 2 | 1 |
1007 | 2 | 0 | 0 | 0 | 3 | 0 | 2 | 4 | 0 | 3 | ... | 2 | 0 | 2 | 1 | 2 | 4 | 0 | 3 | 4 | 2 |
1008 | 2 | 0 | 4 | 0 | 2 | 3 | 3 | 3 | 4 | 2 | ... | 2 | 1 | 1 | 3 | 0 | 4 | 2 | 2 | 2 | 2 |
1009 | 2 | 4 | 3 | 4 | 3 | 2 | 4 | 4 | 2 | 3 | ... | 3 | 1 | 2 | 3 | 0 | 1 | 1 | 2 | 4 | 0 |
864 rows × 57 columns
pip install factor_analyzer
Requirement already satisfied: factor_analyzer in c:\users\dell\anaconda3\lib\site-packages (0.3.2)
Requirement already satisfied: pandas in c:\users\dell\anaconda3\lib\site-packages (from factor_analyzer) (0.25.1)
Requirement already satisfied: scipy in c:\users\dell\anaconda3\lib\site-packages (from factor_analyzer) (1.3.1)
Requirement already satisfied: numpy in c:\users\dell\anaconda3\lib\site-packages (from factor_analyzer) (1.16.5)
Requirement already satisfied: scikit-learn in c:\users\dell\anaconda3\lib\site-packages (from factor_analyzer) (0.21.3)
Requirement already satisfied: pytz>=2017.2 in c:\users\dell\anaconda3\lib\site-packages (from pandas->factor_analyzer) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\users\dell\anaconda3\lib\site-packages (from pandas->factor_analyzer) (2.8.0)
Requirement already satisfied: joblib>=0.11 in c:\users\dell\anaconda3\lib\site-packages (from scikit-learn->factor_analyzer) (0.13.2)
Requirement already satisfied: six>=1.5 in c:\users\dell\anaconda3\lib\site-packages (from python-dateutil>=2.6.1->pandas->factor_analyzer) (1.12.0)
Note: you may need to restart the kernel to use updated packages.
Factor Analyzer
Reduce large no. variables into fewer no. of factors. This is a python module to perform exploratory and factor analysis with several optional rotations. It also includes a class to perform confirmatory factor analysis (CFA), with curtain predefined techniques.
What is Factor Roatation
minimize the complexity of the factor loadings to make the structure simpler to interpret.
There is two type of rotation
- Orthogonal rotation
constrain the factors to be uncorrelated. Althogh often favored, In many cases it is unrealistic to expect the factor to be uncorrelated and forcing then to be uncorrelated make it less likely that the rotation produces a solution with simple structure. Method:
varimax
it maximizes the sum of the variances of the squared loadings and makes the structure simpler. Mathematical equation of
varimax
quatimax
equimax
- Oblique rotation
permit the factors to be correlated with one another often produces solution with a simpler structure.
Here, Our data is uncorrelated so we have used Orthogonal's varimax
rotation method.
Now, We determine no. of factor using Scree plot
we can use also eigenvalue to determine no. of factor but that is more complex and by Scree plot its is to find.
#Try the model with all the variables
from factor_analyzer import FactorAnalyzer # pip install factor_analyzer
fa = FactorAnalyzer(rotation="varimax")
fa.fit(df)
# Check Eigenvalues
ev, v = fa.get_eigenvalues()
ev
# Create scree plot using matplotlib
plt.scatter(range(1,df.shape[1]+1),ev)
plt.plot(range(1,df.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()
Out:
How we find no. of factor?
A scree plot shows the eigenvalues on the y-axis and the number of factors on the x-axis. It always displays a downward curve.The point where the slope of the curve is clearly leveling off (the “elbow) indicates the number of factors that should be generated by the analysis.
As you can see the most usefull factors for explain the data are between 5-6 until falling significantly.
We will fit the model with 5 Factors:
#Factor analysis with 5 fators
fa = FactorAnalyzer(5, rotation="varimax")
fa.fit(df)
AF = fa.loadings_
AF = pd.DataFrame(AF)
AF.index = df.columns
AF
Out:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
Daily events | 0.250416 | 0.058953 | 0.206877 | 0.026094 | 0.028915 |
Prioritising workload | -0.012803 | -0.150045 | 0.555946 | 0.078913 | 0.128156 |
Writing notes | -0.006039 | -0.015927 | 0.420849 | 0.225307 | 0.261380 |
Workaholism | 0.069524 | 0.029275 | 0.527082 | 0.088573 | 0.032979 |
Thinking ahead | 0.023475 | 0.127909 | 0.530457 | 0.035213 | 0.055426 |
Final judgement | 0.046188 | 0.112493 | 0.119861 | 0.381338 | -0.039756 |
Reliability | 0.061028 | -0.102481 | 0.539373 | 0.073534 | -0.003491 |
Keeping promises | 0.053358 | -0.034661 | 0.420538 | 0.121450 | -0.033511 |
Loss of interest | 0.273777 | 0.226286 | 0.003524 | -0.149262 | 0.101882 |
Friends versus money | 0.021279 | -0.111839 | 0.022026 | 0.381357 | -0.045824 |
Funniness | 0.312861 | 0.131400 | -0.043014 | -0.018258 | -0.026083 |
Fake | 0.091188 | 0.469616 | -0.024535 | -0.191798 | 0.019356 |
Criminal damage | 0.154868 | 0.177732 | -0.112659 | -0.240721 | 0.266761 |
Decision making | -0.287128 | 0.102033 | 0.267415 | 0.129336 | 0.158694 |
Elections | 0.074306 | -0.015585 | 0.222003 | 0.131404 | -0.083563 |
Self-criticism | -0.016858 | 0.398420 | 0.229116 | 0.114144 | 0.069707 |
Judgment calls | 0.182082 | -0.010461 | 0.102263 | 0.035675 | 0.086474 |
Hypochondria | -0.040254 | 0.258913 | -0.034874 | 0.042981 | 0.213548 |
Empathy | -0.050152 | -0.073697 | 0.059441 | 0.324982 | 0.133754 |
Eating to survive | -0.010608 | 0.183045 | 0.003261 | -0.015131 | -0.018874 |
Giving | 0.082276 | -0.154549 | 0.112481 | 0.376723 | 0.234000 |
Compassion to animals | -0.083505 | -0.002767 | -0.010424 | 0.262183 | 0.192734 |
Borrowed stuff | -0.097017 | -0.023047 | 0.323253 | 0.171017 | 0.071189 |
Loneliness | -0.199197 | 0.542350 | -0.019272 | 0.045942 | 0.190369 |
Cheating in school | 0.216223 | -0.063183 | -0.384634 | -0.083940 | 0.208210 |
Health | -0.012267 | 0.027867 | 0.131645 | 0.184296 | 0.437826 |
Changing the past | -0.016622 | 0.482307 | -0.161320 | 0.073843 | 0.159231 |
God | 0.047894 | 0.032281 | 0.027136 | 0.453873 | -0.025963 |
Dreams | 0.207076 | -0.187723 | 0.078634 | 0.037709 | -0.124853 |
Charity | 0.163161 | 0.116834 | 0.156898 | 0.354953 | -0.067795 |
Number of friends | 0.514994 | -0.321738 | -0.086711 | 0.241070 | -0.006859 |
Punctuality | 0.004662 | 0.090531 | -0.143569 | 0.069648 | 0.078111 |
Lying | -0.095933 | -0.193370 | 0.001775 | 0.138092 | 0.006950 |
Waiting | 0.032019 | -0.067715 | -0.000820 | 0.075966 | -0.329606 |
New environment | 0.470076 | -0.129745 | -0.058912 | 0.005400 | -0.230743 |
Mood swings | -0.086477 | 0.353226 | -0.041005 | 0.031490 | 0.404388 |
Appearence and gestures | 0.227246 | -0.004762 | 0.105894 | 0.068825 | 0.303119 |
Socializing | 0.537811 | -0.096245 | -0.048127 | 0.135323 | -0.039204 |
Achievements | 0.252835 | 0.048658 | -0.042799 | -0.082401 | 0.111902 |
Responding to a serious letter | -0.126985 | 0.087976 | -0.026876 | 0.022940 | 0.013346 |
Children | 0.079877 | -0.134254 | 0.033040 | 0.440103 | 0.075663 |
Assertiveness | 0.353462 | -0.094372 | 0.002509 | -0.067185 | 0.044117 |
Getting angry | 0.051167 | 0.176922 | -0.086069 | -0.070837 | 0.532025 |
Knowing the right people | 0.478657 | 0.022868 | 0.113503 | -0.045359 | 0.227230 |
Public speaking | -0.385674 | 0.104662 | 0.069712 | 0.030447 | 0.190834 |
Unpopularity | -0.082146 | 0.229228 | 0.079173 | 0.241031 | -0.031212 |
Life struggles | -0.226293 | 0.057892 | -0.059615 | 0.384875 | 0.392060 |
Happiness in life | 0.288585 | -0.541050 | 0.158473 | 0.051235 | -0.064525 |
Energy levels | 0.499978 | -0.478860 | 0.037918 | 0.122773 | -0.025001 |
Small - big dogs | 0.206696 | 0.040211 | -0.143225 | -0.203991 | -0.131298 |
Personality | 0.259646 | -0.393197 | 0.064236 | 0.049013 | -0.056988 |
Finding lost valuables | -0.127907 | -0.011367 | 0.163354 | 0.391951 | -0.101749 |
Getting up | 0.012217 | 0.150551 | -0.312297 | 0.082580 | 0.121198 |
Interests or hobbies | 0.465627 | -0.253289 | 0.065015 | 0.144827 | -0.078694 |
Parents' advice | 0.022594 | -0.032871 | 0.243628 | 0.282252 | 0.113225 |
Questionnaires or polls | -0.045177 | 0.114865 | 0.154309 | 0.188501 | -0.032532 |
Internet usage | -0.046077 | 0.075435 | -0.007799 | -0.081575 | 0.048144 |
#Get Top variables for each Factor
F = AF.unstack()
F = pd.DataFrame(F).reset_index()
F = F.sort_values(['level_0',0], ascending=False).groupby('level_0').head(5) # Top 5
F = F.sort_values(by="level_0")
F.columns=["FACTOR","Variable","Varianza_Explica"]
F = F.reset_index().drop(["index"],axis=1)
F
Out:
FACTOR | Variable | Varianza_Explica | |
---|---|---|---|
0 | 0 | New environment | 0.470076 |
1 | 0 | Energy levels | 0.499978 |
2 | 0 | Number of friends | 0.514994 |
3 | 0 | Socializing | 0.537811 |
4 | 0 | Knowing the right people | 0.478657 |
5 | 1 | Mood swings | 0.353226 |
6 | 1 | Self-criticism | 0.398420 |
7 | 1 | Fake | 0.469616 |
8 | 1 | Changing the past | 0.482307 |
9 | 1 | Loneliness | 0.542350 |
10 | 2 | Writing notes | 0.420849 |
11 | 2 | Workaholism | 0.527082 |
12 | 2 | Thinking ahead | 0.530457 |
13 | 2 | Prioritising workload | 0.555946 |
14 | 2 | Reliability | 0.539373 |
15 | 3 | Friends versus money | 0.381357 |
16 | 3 | Life struggles | 0.384875 |
17 | 3 | Finding lost valuables | 0.391951 |
18 | 3 | Children | 0.440103 |
19 | 3 | God | 0.453873 |
20 | 4 | Appearence and gestures | 0.303119 |
21 | 4 | Life struggles | 0.392060 |
22 | 4 | Mood swings | 0.404388 |
23 | 4 | Health | 0.437826 |
24 | 4 | Getting angry | 0.532025 |
#Show the Top for each Factor
F = F.pivot(columns='FACTOR')["Variable"]
F.apply(lambda x: pd.Series(x.dropna().to_numpy()))
Out:
FACTOR | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
0 | New environment | Mood swings | Writing notes | Friends versus money | Appearence and gestures |
1 | Energy levels | Self-criticism | Workaholism | Life struggles | Life struggles |
2 | Number of friends | Fake | Thinking ahead | Finding lost valuables | Mood swings |
3 | Socializing | Changing the past | Prioritising workload | Children | Health |
4 | Knowing the right people | Loneliness | Reliability | God | Getting angry |
FACTOR 1: Energy levels, Number of friends, Socializing...
Could be: Extraversion
FACTOR 2: Self-ciricism, Fake, Loneliness...
Looks very similar to "Neuroticism"
Factor 3: Thinking ahead, Prioritising workload...
very similar to "Conscientiousness"
Factor 4: Children, God, Finding lost valuables
This factor could be something like "religious" or "conservative", maybe have lowest scores of a "Openness" in Big Five model.
Factor 5: Appearence and gestures, Mood swings
Mmmm it could be "Agreeableness". What do you think it could be represent?
The first three Factors are very clear: Extraversion, Neuroticism and Conscientiousness. The other two not to much. Anyway is a very interesting approximation
Maybe doing first a PCA for remove hight correlate variables like "God" and "Final judgement"could help.
What do you think?
I appreciate especially your
Heart