Skip to content

Latest commit

 

History

History
81 lines (59 loc) · 5.18 KB

File metadata and controls

81 lines (59 loc) · 5.18 KB

NETFLIX MOVIES AND TV SHOWS CLUSTERING

Netflix is a subscription-based streaming service that provides its members with access to a vast library of movies and TV shows. With such a large content catalog, it can be challenging for users to find content that matches their preferences. To address this issue, Netflix uses data analysis and machine learning techniques such as clustering to group their content into similar categories. This project utilizes unsupervised machine learning algorithms to cluster Netflix movies and TV shows based on various attributes like genre, cast, and plot.

Data Analysis Data Visualization Feature Engineering Natural Language Processing Deep Learning Statistics

Pandas NumPy Matplotlib Seaborn Scikit Learn TensorFlow

Linear Regression Lasso Regression Ridge Regression Elastic Net Regression LightGBM XGBoost

Unsupervised Learning Individual Contribution

Project Overview

The Netflix Movies and TV Shows Clustering project aims to improve the user experience on Netflix by providing personalized content recommendations. It utilizes unsupervised machine learning techniques to group the platform's vast library of content into similar categories. By organizing the content library into clusters, Netflix can suggest titles that are more likely to match user interests, leading to increased user engagement and satisfaction.

Key Findings

  • The majority of content on Netflix is suitable for mature audiences, with a TV-MA rating being the most common.
  • The United States is the country with the highest number of productions available on Netflix, followed by India and the United Kingdom.
  • Dramas, Comedies, and Documentaries are the most common genres of content on Netflix.
  • The correlation heatmap shows a moderate positive correlation between the duration of a movie and its release year.
  • A content-based recommender system was built using cosine similarity to make personalized recommendations to users based on the type of show they watched.
Model Number of clusters Silhouette Score Calinski-Harabasz Score Davies-Bouldin Score
K-Means Clustering 7 0.00500 22.0021 10.7600
Hierarchical Clustering 5 0.00048 18.1425 12.1666
DBSCAN Clustering 17 -0.01480 2.8595 1.4252

Tools and Skills

  • Python: Used for data analysis, preprocessing, and model building.
  • Pandas: Employed for data manipulation and analysis.
  • Matplotlib and Seaborn: Utilized for data visualization.
  • Scikit-learn: Utilized for implementing machine learning algorithms such as K-Means, Hierarchical Clustering, and PCA.

Models Used

  • K-Means Clustering
  • Agglomerative Clustering
  • DBSCAN Clustering

Takeaways

  • Clustering helps Netflix provide personalized recommendations to users, improving user engagement.
  • Understanding user preferences through clustering enables Netflix to optimize content production and licensing decisions.
  • Unsupervised learning techniques are essential for analyzing large datasets and deriving meaningful insights without labeled data.

Acknowledgments

This project was completed as part of the Data Science Trainee program at AlmaBetter.

LinkedIn