Skip to content

๐Ÿ“š tsundoku is a Python toolkit to analyze Twitter data.

License

Notifications You must be signed in to change notification settings

ElsevierSoftwareX/SOFTX-D-24-00276

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

98 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“š tsundoku

tsundoku is a Python toolkit to analyze X/Twitter data, following the methodology published in:

Graells-Garrido, E., Baeza-Yates, R., & Lalmas, M. (2020, July). Every colour you are: Stance prediction and turnaround in controversial issues. In 12th ACM Conference on Web Science (pp. 174-183).

Development Setup

We use mamba to install all necessary packages. First, you'll need to install Mamba:

# Install Mamba if you haven't already
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh

After installing Mamba:

# Clone repository
git clone http://github.com/PLUMAS-research/tsundoku

# Move into folder
cd tsundoku

# Create environment, install dependencies and activate it
make mamba-create-env

# Activate the environment 
mamba activate tsundoku

# make the tsundoku module available in your environment
make install-package

Optionally, you may opt to analyze the data generated by tsundoku in a Jupyter environment. In that case, you will need to install a kernel:

# install kernel for use within Jupyter
make install-kernel

If you want to test your installation, you may execute:

python -m tsundoku.utils.test

Environment Configuration

Create an .env file in the root of this repository with the following structure:

TSUNDOKU_PROJECT_PATH=./softwarex
JSON_TWEET_PATH=./test_data/sample_public
TWEET_PATH=./test_data/parquet_public

This is the meaning of each option:

  • TSUNDOKU_PROJECT_PATH: path to your project configuration (this is explained below).
  • JSON_TWEET_PATH: directory where you stored the tweets. This code assumes that you crawl tweets using the Streaming API. These tweets are stored in JSON format, one tweet per line, in files compressed using gzip. Particularly, we assume that each file contains 10 minutes of tweets. The system assumes that those tweets were pre-processed by flattening the structure. These files may be related to any project.
  • TWEET_PATH: folder where the system stores tweets in Apache Parquet format.

Files present in test_data/sample_public have already been flattened. They are in JSON format, one tweet per line. For analysis, you will need to convert those files to Parquet. You can do so with the following command:

python -m tsundoku.data.parse_json_to_parquet 20220501

Project Configuration

The TSUNDOKU_PROJECT_PATH folder defines a project. It contains the following files and folders:

  • config.toml: project configuration.
  • groups/*.toml: classifier configuration for several groups of users. This is arbitrary, you can define your own groups. The mandatory one is called relevant.toml.
  • experiments.toml: experiment definition and classifier hyper-parameters. Experiments enable analysis in different periods (for instance, first and second round of a presidential election).
  • keywords.txt (optional): set of keywords to filter tweets. For instance, presidential candidate names, relevant hashtags, etc.
  • stopwords.txt (optional): list of stop words.

Please see the example in the softwarex folder, which contains a full project that uses the data in test_data.

In config.toml there you will need to configure at least the following attribute:

[project.settings]
data_path = "/home/USERNAME/path_to_project/data"

The data_path attribute states where the imported data will be stored after filtering with your specified keywords.

Data and Projects

tsundoku has three folders within the project data folder: raw, interim, and processed.

The raw folder contains a subfolder for each day you aim to analyze. The format is YYYY-MM-DD.

The following command imports a specific date from TWEET_PATH:

$ ./tsundoku-cli import_date 20220501

This imports that specific day into the project.

For every day of data you can compute features, such as document-term matrices:

$ ./tsundoku-cli compute_features 20220501

You may import multiple days using the --days n parameter (with n being an integer).

In the experiments file you defined experiments such as:

[experiments]
[experiments.workers_day]
key = 'workers_day'
folder_start = '2022-05-01'
folder_end = '2022-05-01'
discussion_only = 1
discussion_directed = 0

In this case, there is a single experiment, of key value workers_day. You can perform analysis through the following commands:

  1. $ ./tsundoku-cli prepare_experiment workers_day: this will prepare the features for the specific experiment. For instance, a experiment has start/end dates, so it consolidates the data between those dates only.
  2. $ ./tsundoku-cli classify_users workers_day relevance: this command predicts whether a user profile is relevant or not (noise) for the experiment. It uses a XGB classifier.
  3. $ ./tsundoku-cli classify_users workers_day stance: this command predicts groups within users. The sample configuration includes stance. You can define as many groups as you want. Note that for each group you must define categories in the corresponding .toml file. In this file, if a category is called noise, it means that users who fall in the category will be discarding when consolidating results.
  4. $ ./tsundoku-cli consolidate_analysis workers_day stance: this command takes the result from the classification and consolidates the analysis with respect to interaction networks, vocabulary, and other features. It requires a reference group to base the analysis (for instance, stance allows you to characterize the supporters of a political position).
  5. $ ./tsundoku-cli generate_report workers_day stance: this command generates a summary report for the workers_day experiment in HTML format, having stance as a reference group.
  6. $ ./tsundoku-cli open_report workers_day: this command opens a Web browser to display the corresponding report.

About the name

Tsundoku is a Japanese word (็ฉใ‚“่ชญ) that means "to pile books without reading them" (see more in Wikipedia). It is common to crawl data continuously and do nothing with them later. So, tsundoku provides a way to work with all those piled-up datasets (mainly in the form of tweets).

About

๐Ÿ“š tsundoku is a Python toolkit to analyze Twitter data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Makefile 1.6%