TU/e Bachelor Data Science 2021 • DBL Data Challenge Q4 • Group 17
- Loading in a big tweet dataset in json format
- Cleaning & filtering of tweets
- Relational database querying for optimal efficiency
- Categorizing and analyzing tweets
- Creating visualizations
This GitHub repo is designed to be able to efficiently load our project consisting of Jupyter Notebooks and a large tweet dataset. This guide will guide you through the installation process.
The following software was used during the project and is required to be able to run this project:
- Python 3 - An interpreted, object-oriented, high-level programming language with dynamic semantics
- Jupyter Notebook - An open source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text
- PostgreSQL - An open source relational database management system
Please follow this tutorial by the Youtuber Jinu Jawad M, in which he explains how to setup a database with PostgreSQL and how to setup a user together with a password.
When setting this up please ensure that you are matching the different parameters (database name, username, password) with the ones displayed below.
database_host = "localhost"
database_name = "dbl_data_challenge"
database_user = "admin"
database_pass = "vZtbqKNXGz27cQCH"
The Jupyter Notebooks use a number of publicly available Python libraries to work properly:
- Psycopg2 - PostgreSQL Database Adapter
- Pandas - Powerful data structures for data analysis, time series, and statistics
- Numpy - The fundamental package for array computing with Python
- Sklearn - A set of python modules for machine learning and data mining
- Searborn - Statistical data visualization
- Nltk - Natural Language Toolkit
- Tensorflow - TensorFlow is an end-to-end open source platform for machine learning
- Emoji - Emoji for Python
- Matplotlib - Plotting package
- Json - Encoding basic Python object hierarchies
- Os - A portable way of using operating system dependent functionality
- Time - Provides various time-related functions
- Re - Provides regular expression matching operations similar to those found in Perl
- Statistics - Provides functions for calculating mathematical statistics of numeric data
- Strings - Strings for humans
- Wordcloud - Create a word cloud
- Warning - Alert the user of some condition in a program
More information related to these libraries will be given in the Installation section down below.
To download the exact Twitter tweet dataset that was used to conduct research visit this link. This dataset consists of 30GB of files containing 500+ files with tweets in json format. Every single line in a file contains the data for a separate tweet. Please place these json files in a folder called data with the following structure:
dbl_data_challenge /
1. Extract Tweets.ipynb
2. Extract Conversations.ipynb
...
data /
airlinetweets1.json
airlinetweets2.json
...
If you want to make use of another dataset, please make sure that the files are in json format, every single line of the files contains a new tweet and finally make sure this data was obtained using Twitter API V1. Otherwise the data will very likely be incompatible with the current code.
This project requires the software mentioned in Software and Setting up the database to be installed. The libraries mentioned in Libraries are also mandatory, however these will be automatically installed and imported within the Jupyter Notebooks. The dataset that was used can be found and downloaded in the Dataset section, also make sure to follow the instructions that are mentioned there to ensure that the code to runs properly.
Please run the Jupyter Notebooks in the following order:
1. Extract Tweets.ipynb
2. Extract Conversations.ipynb
3. Extract Replies.ipynb
4. Extract Root Groups.ipynb
5. Extract ABA Groups.ipynb
6. Sentiment Analysis.ipynb
7. Create Visualizations.ipynb
TU/e 2021