This project analyzes data from OkCupid profiles, focusing on demographic information, lifestyle choices, and sentiment analysis of user essay responses. The goal is to extract meaningful insights into user behaviors, profile completion rates, and trends in income and sentiment based on demographic and lifestyle factors. The analysis also identifies potential correlations between specific profile attributes and user-reported income and sentiment.
- Introduction
- Data Source
- Data Preprocessing
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Modeling
- Findings and Insights
- Conclusion
This project analyzes OkCupid profiles to uncover patterns in user preferences, behaviors, and the relationship between various profile attributes such as job type, pet preferences, and income. Additionally, the project aims to understand the sentiment of user essays and how lifestyle choices influence user sentiment and income levels. The OkCupid dataset was taken from kaggle.
The dataset used in this project was sourced from public OkCupid profiles. The key fields in the dataset include:
- Mandatory Fields: Age, Status, Sex, Orientation
- Optional Fields with "Prefer Not to Say" Option: Body Type, Height, Languages, Ethnicity, Religion, Education, Employment, Sign, Drinks, Smoking, Drugs, Diet, Children, Pets
To prepare the dataset for analysis, the following steps were taken:
-
Handling Missing Values:
- Replace missing values in columns with "unknown".
- Drop rows with missing values in the essay responses, as these are critical for sentiment analysis.
-
Outlier Detection:
- Remove outliers for age (greater than 69).
- Replace missing values for height with the median value.
-
Profile Completion Rates:
- Calculate profile completion rates by counting the non-null fields for each profile.
The dataset had a significant number of missing values across various optional fields, such as income and offspring. The NaN values were replaced or filled as part of the data preprocessing.
Outliers in fields like age were detected and removed, and missing values in critical fields were replaced with appropriate values (e.g., median for height).
The dataset was analyzed to identify patterns in profile completion. Most users complete about 80% of their profiles, and key fields like income and offspring were frequently left blank.
The body type field was simplified into categories like overweight, curvy, athletic, and skinny to make analysis more robust.
Dietary preferences were grouped into categories such as omnivore, vegetarian, vegan, and other based on the user's response.
Job roles were categorized into broader groups such as Tech, Finance, Creative, and Blue Collar to reduce dimensionality and enhance interpretability.
The sentiment ratio of user essay responses was calculated using the Opinion Lexicon from NLTK. The ratio of positive to negative words in user essays was used to determine overall sentiment.
A Lasso Regression model was used to predict income levels based on features such as job type, drinking habits, pet preferences, and family planning choices. The coefficients from the model indicated that lifestyle choices like drinking habits and drug use were strongly correlated with income.
A Decision Tree Regressor was used to predict the sentiment ratio of user essays based on demographic and lifestyle features such as age, job, and drinking habits. The tree highlighted that age and income were strong predictors of sentiment.
- Income Correlates: Retired individuals, people who dislike dogs, and individuals who want kids tend to have higher income levels. Additionally, drinking and drug use were positively correlated with income.
- Sentiment Analysis: Older users tend to have more positive essays, and men who drink frequently have higher positivity scores in their essays.
- Profile Completeness: Users tend to leave optional fields like diet, offspring, and religion blank, but generally fill in key mandatory fields like age and orientation.
- Profile Optimization: Encourage users to fill out more optional fields like offspring and religion, as these are critical for matching algorithms.
- Sentiment-Based Matching: Incorporate sentiment analysis in matching algorithms to pair users with similar sentiment scores, even if they haven’t filled out essay responses.
This project provides insights into OkCupid user behavior, including correlations between income, sentiment, and lifestyle choices. By leveraging these insights, OkCupid can improve its matching algorithms, enhance user experience, and potentially boost premium subscription sales.