DBL Data Challenge Project

TU/e Bachelor Data Science 2021 • DBL Data Challenge Q4 • Group 17

Features

Loading in a big tweet dataset in json format
Cleaning & filtering of tweets
Relational database querying for optimal efficiency
Categorizing and analyzing tweets
Creating visualizations

This GitHub repo is designed to be able to efficiently load our project consisting of Jupyter Notebooks and a large tweet dataset. This guide will guide you through the installation process.

Software

The following software was used during the project and is required to be able to run this project:

Python 3 - An interpreted, object-oriented, high-level programming language with dynamic semantics
Jupyter Notebook - An open source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text
PostgreSQL - An open source relational database management system

Setting up the database

Please follow this tutorial by the Youtuber Jinu Jawad M, in which he explains how to setup a database with PostgreSQL and how to setup a user together with a password.

When setting this up please ensure that you are matching the different parameters (database name, username, password) with the ones displayed below.

database_host = "localhost"
database_name = "dbl_data_challenge"
database_user = "admin"
database_pass = "vZtbqKNXGz27cQCH"

Libraries

The Jupyter Notebooks use a number of publicly available Python libraries to work properly:

Psycopg2 - PostgreSQL Database Adapter
Pandas - Powerful data structures for data analysis, time series, and statistics
Numpy - The fundamental package for array computing with Python
Sklearn - A set of python modules for machine learning and data mining
Searborn - Statistical data visualization
Nltk - Natural Language Toolkit
Tensorflow - TensorFlow is an end-to-end open source platform for machine learning
Emoji - Emoji for Python
Matplotlib - Plotting package
Json - Encoding basic Python object hierarchies
Os - A portable way of using operating system dependent functionality
Time - Provides various time-related functions
Re - Provides regular expression matching operations similar to those found in Perl
Statistics - Provides functions for calculating mathematical statistics of numeric data
Strings - Strings for humans
Wordcloud - Create a word cloud
Warning - Alert the user of some condition in a program

More information related to these libraries will be given in the Installation section down below.

Dataset

To download the exact Twitter tweet dataset that was used to conduct research visit this link. This dataset consists of 30GB of files containing 500+ files with tweets in json format. Every single line in a file contains the data for a separate tweet. Please place these json files in a folder called data with the following structure:

dbl_data_challenge /
    1. Extract Tweets.ipynb
    2. Extract Conversations.ipynb
    ...
    data /
        airlinetweets1.json
        airlinetweets2.json
        ...

If you want to make use of another dataset, please make sure that the files are in json format, every single line of the files contains a new tweet and finally make sure this data was obtained using Twitter API V1. Otherwise the data will very likely be incompatible with the current code.

Installation

This project requires the software mentioned in Software and Setting up the database to be installed. The libraries mentioned in Libraries are also mandatory, however these will be automatically installed and imported within the Jupyter Notebooks. The dataset that was used can be found and downloaded in the Dataset section, also make sure to follow the instructions that are mentioned there to ensure that the code to runs properly.

Please run the Jupyter Notebooks in the following order:

1. Extract Tweets.ipynb
2. Extract Conversations.ipynb
3. Extract Replies.ipynb
4. Extract Root Groups.ipynb
5. Extract ABA Groups.ipynb
6. Sentiment Analysis.ipynb
7. Create Visualizations.ipynb

License

TU/e 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DBL Data Challenge Project

Features

Software

Setting up the database

Libraries

Dataset

Installation

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
demo		demo
1. Extract Tweets.ipynb		1. Extract Tweets.ipynb
2. Extract Conversations.ipynb		2. Extract Conversations.ipynb
3. Extract Replies.ipynb		3. Extract Replies.ipynb
4. Extract Root Groups.ipynb		4. Extract Root Groups.ipynb
5. Extract ABA Groups.ipynb		5. Extract ABA Groups.ipynb
6. Sentiment Analysis.ipynb		6. Sentiment Analysis.ipynb
7. Create Visualizations.ipynb		7. Create Visualizations.ipynb
GITHUBURL.txt		GITHUBURL.txt
README.md		README.md

kc1nn4y/dbldatachallenge

Folders and files

Latest commit

History

Repository files navigation

DBL Data Challenge Project

Features

Software

Setting up the database

Libraries

Dataset

Installation

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages