Using Machine Learning to Spot a Twitter Bot

This project is for the Elevate course SCI 498 - 410: Online Social Network Analysis, taught by Dr. Aron Culotta at the Illinois Institute of Technology.

Here is the poster for our project.

Group Members:

Deyuan Chen, Southern University of Science and Technology
Chunjiang Li, University of Posts and Telecommunications
Chenguang Tang, University of Posts and Telecommunications

Background

What is a Twitter Bot?

Twitter bots are automated user accounts that interact with Twitter using an application programming interface (API). They can automatically perform actions such as tweeting, re-tweeting, or direct messaging other accounts.

Many are used to perform important functions, such as tweet about earthquakes in real-time and serve as part of a warning system.

However, there are also a lot of improper usages such as violating user privacy, spamming or spreading fake news.

So, we delved into detecting twitter bots to prevent malicious acts in the future, which can also be applied into other social media platforms.

How can we detect a bot?

We list some typical characteristics of bots on Twitter:

The account primarily retweets content, rather than tweeting original content.
The account’s tweet frequency is higher than a human user could feasibly achieve.
The account may have a high number of followers and also be following a lot of accounts; conversely, some bot accounts are identifiable because they send a lot of tweets but only have a few followers.
Many bots tweet the same content as other users at roughly the same time.
Short replies to other tweets can also indicate automated behavior.
There is often no biography, or indeed a photo, associated with bot Twitter accounts.

Overview Diagram

Data and Methods

Our training data size:

	Count
Bot	8841
Human	4585
Total	13426

We considered three classifiers:

Firstly, we extracted a few features with our data and used the Logistic Regression classifier to fit the model.
Then we calculated the accuracy with cross-validation and compared the accuracy with two additional classifiers: Multi-layer Perceptron and Random Forest.
Finally, we decided to use Logistic Regression because of its outstanding performance.

We mainly used two types of features to classify bots:

One type is the attributes of twitter users, such as the followers count, verified or not.
Another type is based on the text analysis of user tweets. We used CountVectorizer to extract frequently used tri-grams.

Classifiers and Features

We compared three classifiers:

	F1	Precision	Recall
Logistic Regression	0.91	0.91	0.91
Multi-layer Perceptron	0.84	0.84	0.84
Random Forest	0.89	0.89	0.89

The most predictive features of bots were:

Most common trigram, bots tend to tweet the same contents.
Default profile, indicates that user has not altered the background of their profile. A high percentage of bots use default profile.
Statuses count, the number of tweets (including retweets) issued by the user. Since many bots are used to spread fake news or something, bots have a higher statuses count.

And the most predictive features of humans were:

verified
followers_count
tweets_avg_mentions

Choose Optimal Parameters

Notes:

ngram (min_n, max_n): an n-gram is a contiguous sequence of n items from a sentence. Here all n- grams with lower boundary min_n and upper boundary max_n will be extracted.
min_df: when building the vocabulary, we ignore terms that have a document frequency strictly lower than the given threshold. Correspondingly, there is a parameter named max_df which is used to ignore terms with high frequency.

Results

@everyword has tweeted every word of the English language. It started in 2007 and tweeted every thirty minutes until 2014.

@_grammar_ detects tweets that have improper usage of grammar, and then posts solutions.

@NetflixBot tweets steady stream of videos that are newly available to stream on Netflix in the United States.

Conclusions

We observed that many bots tweet the same content as other users.

What’s more, there is often no biography, or indeed a photo, associated with bot Twitter accounts.

Interestingly, we also found that Twitter has brought in more stringent policies regarding automation on the platform.

Limitations

The features we choose are limited, many of which are relevant to the contents of the tweets.

Also, the training dataset is not big enough, so the the parameters we chose may not be optimal.

Therefore, we decide to explore more features and train more data to improve the classifier.

Related Work

Kudugunta, S., & Ferrara, E. (2018). Deep neural networks for bot detection. Information Sciences, 467, 312-322.
Z. Chu, S. Gianvecchio, H. Wang and S. Jajodia, "Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg?," in IEEE Transactions on Dependable and Secure Computing, vol. 9, no. 6, pp. 811-824, Nov.-Dec. 2012.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.idea		.idea
docs		docs
elevate_osna.egg-info		elevate_osna.egg-info
lessons		lessons
notebooks		notebooks
osna		osna
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
credentials.json		credentials.json
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini
update.sh		update.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using Machine Learning to Spot a Twitter Bot

Background

What is a Twitter Bot?

How can we detect a bot?

Overview Diagram

Data and Methods

Classifiers and Features

Choose Optimal Parameters

Results

Conclusions

Limitations

Related Work

About

Releases

Packages

Contributors 4

Languages

License

tapilab/elevate-osna-bots

Folders and files

Latest commit

History

Repository files navigation

Using Machine Learning to Spot a Twitter Bot

Background

What is a Twitter Bot?

How can we detect a bot?

Overview Diagram

Data and Methods

Classifiers and Features

Choose Optimal Parameters

Results

Conclusions

Limitations

Related Work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages