TEDXTALK-Prediction

The main objective is to build a predictive model, which could help in predicting the views of the videos uploaded on the TEDx website.

-- Project Status: [Completed]

Project Summary :

Problem Statement :

The main objective is to build a predictive model, which could help in predicting the views of the videos uploaded on the TEDx website.

About the Data :

We have the data of previous TED talk events , which contains data points such as the length (duration ) of the talk, topics , speaker occupation and textual features such as Transcript , Title , and Description And most importantly , the target variable : the view of the video The Data is available for 4005 TED talks .

Dataset info

Number of records: 4,005
Number of attributes: 19

Features information:

The dataset contains features like:

talk_id: Talk identification number provided by TED
title: Title of the talk
speaker_1: First speaker in TED's speaker list
all_speakers: Speakers in the talk
occupations: Occupations of the speakers
about_speakers: Blurb about each speaker
recorded_date: Date the talk was recorded
published_date: Date the talk was published to TED.com
event: Event or medium in which the talk was given
native_lang: Language the talk was given in
available_lang: All available languages (lang_code) for a talk
comments: Count of comments
duration: Duration in seconds
topics: Related tags or topics for the talk
related_talks: Related talks (key='talk_id',value='title')
url: URL of the talk
description: Description of the talk
transcript: Full transcript of the talk

Target Variable :

views: Contains Count of views of every talk

Approach taken :

The task was divided into 2 main parts :

Statistical Analysis over the dataset to discover relationships between each feature and the target variable . So that this relationship information can be used by the management in making better Business decisions
Creating a Machine Learning Pipeline , that can take in the data of any new video and predict how many views it will generate on a daily basis .It was required to kepp this pipeline modular , such that it can be retrained often when new data is collected

Project Work flow

Importing Libraries
Loading the dataset
Data Cleaning
EDA on features
Feature selection
Fitting the regression models
HyperParameter Tuning
Evaluation Metrices of the model
Final selection of the model
Conclusion

Technical Details for ML :

We used many Algorithms ( Random Forest , XGBoost and CatBoost ) We used RandomSearchCV for HyperParameter Tuning Comparing both R2 Score , we can see that Random Forrest and XGBoost model performs the best

Technical Insights from exploring the Data :

● For the ML Pipeline , the XGBoost Model performed the best ● For the NLP Pipeline , the Random Forest Model performed the Best ● Feature Engineering and Feature Extraction helped in increasing the model performance

Conclusions : Insights from exploring the Data :

● Topics like Technology , Science , Education , Biology attract the attention of viewers more than other topics . ● Entrepreneurs and Activists are the most engaging speakers

Python Libraries used

Datawrangling :

Numpy
Pandas

For Graphing :

Matplotib
Seaborn

Machine learning :

Scikit-Learn
SK-Opt
XGBoost
CatBoost

Miscellaneous :

Google colab tools

📜 Credits

< Sarvesh > | Data Scientist | Machine Learning Engineer | Deep Learning enthusiast

Linkedin: Contact me for Data Science Project Collaborations

YouTube: Follow me for interesting AI/ML Projects

References:

https://scikit-learn.org/stable/

https://www.nltk.org/

https://catboost.ai/

https://xgboost.ai/

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://www.kdnuggets.com/2020/05/hyperparameter-optimization-machine-learning-models.html

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Copy of data_ted_talks.csv		Copy of data_ted_talks.csv
README.md		README.md
Sarvesh_Kumar_Yadav_TedXTalk.ipynb		Sarvesh_Kumar_Yadav_TedXTalk.ipynb
cleaned_data.csv		cleaned_data.csv
cleaned_data_NLP.csv		cleaned_data_NLP.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TEDXTALK-Prediction

-- Project Status: [Completed]

Project Summary :

Problem Statement :

About the Data :

Approach taken :

Project Work flow

Technical Details for ML :

Technical Insights from exploring the Data :

Conclusions : Insights from exploring the Data :

Python Libraries used

📜 Credits

References:

About

Releases

Packages

Languages

sky309/TEDXTALK-Prediction

Folders and files

Latest commit

History

Repository files navigation

TEDXTALK-Prediction

-- Project Status: [Completed]

Project Summary :

Problem Statement :

About the Data :

Approach taken :

Project Work flow

Technical Details for ML :

Technical Insights from exploring the Data :

Conclusions : Insights from exploring the Data :

Python Libraries used

📜 Credits

References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages