The web-app link to interact with the spotify recommendation system can be found here: https://spotify-podcast-clustering.onrender.com
Note: The intro animation is taken from: https://www.youtube.com/watch?v=cB8JW-uLuC4.
If plots or tables don't fit properly, press Ctrl - or Ctrl + to adjust the zoom level until the layout looks satisfactory.
Here is a short video to showcase the web-app:
spotify-web-app.mp4
Spotify has already developed several music-specific metrics/derived features about a particular track/music, specifically: (a) acousticness, (b) danceability, (c) energy, (d) instrumentalness, (e) speechiness, and (f) valence. All these metrics are in the range of [0,1] and can be used to cluster users based on their musical taste. This project is meant to construct similar informative metrics for Spotify’s podcast data, and create a novel recommendation system based on these metrics.
Using selenium and Spotify API:
fetch_top_podcast.py
: scrape all top 50 podcasts from here for each genre.fetch_podcast_details.py
: retrieve metadata, filtered for english podcasts, resulting in 818 podcasts.fetch_episode_details.py
: scrape details for all episodes, giving us a total of 284,481 episodes.
The python library nltk
(natural language toolkit) is used to clean and tokenize the episode descriptions. In summary, the following cleaning is done using clean_description.py
:
- Text Normalization: accent removal, lowercasing, whitespace normalization.
- Sentence-Level Cleaning: contraction expansion, URL removal, promotional density check.
- Token-Level Cleaning: lemmatization, stopword removal, promotional keyword removal, character validation, length check, dictionary validation, special character removal.
The compute_metrics.py
script computes three metrics:
The NTFS metric measures the cosine similarity between two frequency vectors and is defined as:
Strengths: Robust for sparse vectors. Weakness: Assumes all tokens equally important.
Compute JTS metric signifying proportion of overlapping tokens.
Strengths: Simple and interpretable measure of overlap. Weakness: Sensitive to scaling.
Uses L1-normalized frequency vectors that emphasizing token diversity.
Strength: Highlights diversity. Weakness: Assumes uniform importance across tokens.
The resulting combined matrix (where each element in
where
Suppose an arbitrary podcast
Next, we quantify dissimilarity by computing the euclidean 2-norm distance with respect to podcast
Finally, we sort by distance (lowest to highest) and report the