A simple Python scraper and parser for Twitter pages using Beautifulsoup 4 and Selenium webdriver. Parser output is json, see details below.
pip install -r requirements.txt
In run.py:
- Define a list of Twitter handles
- Set a date for the scraper to go back in time
- Define verbosity and output paths
python run.py
For each Twitter handle the parser will output two json files:
handle-stats.json with the keys:
- name
- url
- bioText
- following
- frequency: tweets per day
- followers
- location:
- favorites: average received favorites per tweet
- tweets: total number of tweets
- joinDate
- retweets: average received retweets per tweet
handle-tweets.json with an entry for each tweet containing the keys:
- origin: tweet (T) or retweet (RT)?
- text: full text including hashtags, mentions and links
- hashtags: a list of used hashtags
- retweets: number of received retweets
- favorites: number of received favorites
- time: time of tweet
- mentions: list of @ mentions
To import the results as a Python dictionary use the jsonLoad function:
from tScrape import jsonLoad
results = jsonLoad(handle, path)
Use the standard dictionary methods to inspect the content:
results.keys()
results.values()
for key en results.keys():
print key, results[key]