Skip to content

Latest commit

 

History

History
378 lines (302 loc) · 8.91 KB

README.md

File metadata and controls

378 lines (302 loc) · 8.91 KB

Cascade Influence

This repository contains:

  • The scripts to estimate user influence from Twitter information cascades (i.e. Cas.In);
  • A small dataset of 20 cascades for testing Cas.In;
  • A hands-on tutorial to walk you through running Cas.In on real cascades.

Citation

The algorithm was introduced in the paper:

Rizoiu, M.-A., Graham, T., Zhang, R., Zhang, Y., Ackland, R., & Xie, L. (2018). #DebateNight: The Role and Influence of Socialbots on Twitter During the 1st 2016 U.S. Presidential Debate. In Proc. International AAAI Conference on Web and Social Media (ICWSM ’18) (pp. 1–10). Stanford, CA, USA.
pdf at arxiv with supplementary material

Bibtex

@inproceedings{rizoiu2018debatenight,
    address = {Stanford, CA, USA},
    author = {Rizoiu, Marian-Andrei and Graham, Timothy and Zhang, Rui and Zhang, Yifei and Ackland, Robert and Xie, Lexing},
    booktitle = {International AAAI Conference on Web and Social Media (ICWSM '18)},
    title = {{{\#}DebateNight: The Role and Influence of Socialbots on Twitter During the 1st 2016 U.S. Presidential Debate}},
    url = {https://arxiv.org/abs/1802.09808},
    year = {2018}
}

License

Both dataset and code are distributed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, a copy of which can be obtained following this link. If you require a different license, please contact Yifei Zhang, Marian-Andrei Rizoiu or Lexing Xie.

How to run Cas.In in a terminal:

Required packages:

  • python3
  • numpy
  • pandas

Arguments of Cas.In:

--cascade_path : the path of cascade file (see the format here below).

--time_decay : the coefficient value of time decay (hyperparameter $r$ in the paper). Default:-0.000068

--save2csv : save result to csv file. Default: False

Command:

cd scripts
python3 influence.py --cascade_path path/to/file

File format and toy dataset

Dataset

We provide a toy dataset -- dubbed SMH -- for testing Cas.In. It was collected in 2017 by following the Twitter handle of the Sydney Morning Herald newspaper (tweets and retweets mentioning SMH or linking to an article from SMH).

The data contains 20 cascades (one file per cascade). We annonymized the user_id (as per Twitter's ToS) by mapping original values to a sequence from 0 to n, while preserving the identity of users across cascades.

The format cascade files:

  • A csv file with 3 columns (time, magnitude, user_id), where each row is a tweet in the cascade:
    • time represents the timestamp of tweet -- the first tweet is always at time zero, for the following retweets it shows the offset in seconds from the initial tweet;
    • magnitude is the local influence of the user (here the number of followers);
    • user_id the id of the user emitting the tweet (here annonymized).
  • The rows in the file (i.e. the tweets) are sorted by the timestamp;

eg:

time,magnitude,user_id 
0,4674,"0"
321,1327,"1"
339,976,"2"
383,477,"3"
699,1209,"4"
824,119,"5"
835,1408,"6"
1049,896,"7"

Cascade influence tutorial

Next, we drive you through using Cas.In for estimating user influence starting from a single cascade.

Preliminary

We need to first load all required packages of cascade influence.

cd scripts
import pandas as pd
import numpy as np
from casIn.user_influence import P,influence

Compute influence in one cascade

Read data

Load the first cascade in the SMH toy dataset:

cascade = pd.read_csv("../data/SMH/SMH-cascade-0.csv")
cascade.head()
time magnitude user_id
0 0 991 419
1 127 1352 658
2 2149 2057 264
3 2465 1155 1016
4 2485 1917 790

Compute matrix P

We first need to compute the probabilities $p_{ij}$, where $p_{ij}$ is the probability that $j^{th}$ tweet is a direct retweet of the $i^{th}$ (see the paper for more details). We need to specify the hyper-parameter $r$, the time decay coefficient. Here we choose $r = -0.000068$.

p_ij = P(cascade,r = -0.000068)

Compute user influence and matrix M

The function influence() will return an array of influences for each user and the matrix $M = m_{ij}$, where $m_{ij}$ is the influence of the $i^{th}$ tweet of the $j^{th}$ tweet (direct and indirect).

inf, m_ij = influence(p_ij)

Link influence with user_id

Now, we add the computed user influence back to the pandas data structure.

cascade["influence"] = pd.Series(inf)
cascade.head()
time magnitude user_id influence
0 0 991 419 60.000000
1 127 1352 658 34.590370
2 2149 2057 264 29.656122
3 2465 1155 1016 13.535845
4 2485 1917 790 15.913873

Compute influence over multiple cascades

Load function

The function casIn() compute influence in one cascade, which basically contain all the steps described above

from casIn.user_influence import casIn
influence = casIn(cascade_path="../data/SMH/SMH-cascade-0.csv",time_decay=-0.000068)
influence.head()
time magnitude user_id influence
0 0 991 419 60.000000
1 127 1352 658 34.590370
2 2149 2057 264 29.656122
3 2465 1155 1016 13.535845
4 2485 1917 790 15.913873

Load multiple cascades

The SMH toy dataset contains 20 cascades for testing out Cas.In. Let's load all of them:

cascades = []
for i in range(20):
    inf = casIn(cascade_path="../data/SMH/SMH-cascade-%d.csv" % i,time_decay=-0.000068)
    cascades.append(inf)
cascades = pd.concat(cascades)

Compute user influence in multiple cascades

The influence of a user is by definition the mean influence of the tweets they emit. We compute the user influence as follows:

result = cascades.groupby("user_id").agg({"influence" : "mean"})
result.sort_values("influence",ascending=False).head()
influence
user_id
734 214.000000
1225 205.000000
755 190.554571
60 189.557461
581 141.033129