-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No longer scraping past the first page. #1
Comments
Could You please share the hashtag you were trying to scrape?
I will try to reproduce this on my computer.
Thanks
|
Sure thing. I was trying to scrap the hashtag #festabbb on twitter, which is trending in my country. It has about 18k retweets as of now. I've tried different hashtags, but it still scrapes only the first 20 tweets i.e. the first page. |
I just cloned the repo and ran the crawler for Did you increase crawler speed? Please try lowering the settings |
I see. In that case, the problem must be on my end. And no, I didn't put the #, and I used all the default settings. |
yes, I am on Linux. But that shouldn't be an issue. I will test on my windows when I am home and get back to you. Could you try again with reduced concurrency? I have noticed sometimes Twitter limits results for the same hashtags when i ran the crawler twice. |
Yes, I just tried lowering concurrency to 4, and after that I tried increasing download delay from 3 to 300 (is it in milliseconds or seconds?) By the way, I just tried setting ROBOTSTXT_OBEY to false, and surprisingly it seems to be working now! I'm not sure if I was supposed to set it to false since the beginning, but still. |
download delay is in seconds. Please set it to 0 or 1(probably 0). Yes. Settings ROBOTSTXT_OBEY to false is a great idea. Please let me know the crawl result. Thanks |
Just finished crawling over that hashtag once again, with 2 for download delay and 5 for concurrent requests. I'm trying to be gentle with Twitter servers so that they won't get mad and ban my IP. It stopped after 621 tweets. It's way better than the 20 tweets I was getting previously, but still very far from the 20k tweets there seems to be. I think I'll just mess with the settings until I find the sweet spot now that I know for sure that it definitely works, and I'll let you know. |
From what I have experienced. The most data I could pull from twitter for a single hashtag was around 5k. It turns out if you browse twitter manually and keep scrolling they will stop showing tweets after a certain number(which seems to differ according to hashtags). I verified this manually. so you could use large number of related hashtags or crawl frequently say in a few days. if the tag is popular let's say it gets 1000 tweets per day. Then you can end up with 10-20k in a week. The best settings I found is concurrency=2 and download_delay=0. |
Hello my friend. I read about your tool on medium and I must say it's very good, I've been in love with it. However, I came across a small problem which I just couldn't solve on my own. After trying to scrape for a single hashtag, it's only scraping the first 20 tweets, aparently because it's not able to fetch the next page. I'm using it on Windows 7, with Python properly set up and all it's dependencies. I suspect it might be due to some update on Twitter's end, but I'm not sure. Any help?
Thanks in advance.
The text was updated successfully, but these errors were encountered: