Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No longer scraping past the first page. #1

Open
sparremberger opened this issue Feb 26, 2020 · 9 comments
Open

No longer scraping past the first page. #1

sparremberger opened this issue Feb 26, 2020 · 9 comments

Comments

@sparremberger
Copy link

Hello my friend. I read about your tool on medium and I must say it's very good, I've been in love with it. However, I came across a small problem which I just couldn't solve on my own. After trying to scrape for a single hashtag, it's only scraping the first 20 tweets, aparently because it's not able to fetch the next page. I'm using it on Windows 7, with Python properly set up and all it's dependencies. I suspect it might be due to some update on Twitter's end, but I'm not sure. Any help?
Thanks in advance.
image

@superryeti
Copy link
Owner

superryeti commented Feb 27, 2020 via email

@sparremberger
Copy link
Author

Sure thing. I was trying to scrap the hashtag #festabbb on twitter, which is trending in my country. It has about 18k retweets as of now. I've tried different hashtags, but it still scrapes only the first 20 tweets i.e. the first page.

@superryeti
Copy link
Owner

I just cloned the repo and ran the crawler for festabbb. It seems to be pulling data without any issues.
It pulled 699 items.(i had concurrent request set to 5)

Did you increase crawler speed? Please try lowering the settings
Did you put a #festabb in input? (Note you don't need to use the # sign just put festabbb.

@sparremberger
Copy link
Author

sparremberger commented Feb 27, 2020

I see. In that case, the problem must be on my end. And no, I didn't put the #, and I used all the default settings.
Are you using Linux? I'll try running Linux on a virtual machine and see if it works, Python is usually tricky on Windows. Thanks a lot!

@superryeti
Copy link
Owner

superryeti commented Feb 27, 2020

yes, I am on Linux. But that shouldn't be an issue.

I will test on my windows when I am home and get back to you.

Could you try again with reduced concurrency?

I have noticed sometimes Twitter limits results for the same hashtags when i ran the crawler twice.

@sparremberger
Copy link
Author

Yes, I just tried lowering concurrency to 4, and after that I tried increasing download delay from 3 to 300 (is it in milliseconds or seconds?)

By the way, I just tried setting ROBOTSTXT_OBEY to false, and surprisingly it seems to be working now! I'm not sure if I was supposed to set it to false since the beginning, but still.

@superryeti
Copy link
Owner

superryeti commented Feb 27, 2020

download delay is in seconds. Please set it to 0 or 1(probably 0).

Yes. Settings ROBOTSTXT_OBEY to false is a great idea.
The crawler might be obeying some new settings in robots.txt of twitter.

Please let me know the crawl result.

Thanks

@sparremberger
Copy link
Author

Just finished crawling over that hashtag once again, with 2 for download delay and 5 for concurrent requests. I'm trying to be gentle with Twitter servers so that they won't get mad and ban my IP.

It stopped after 621 tweets. It's way better than the 20 tweets I was getting previously, but still very far from the 20k tweets there seems to be. I think I'll just mess with the settings until I find the sweet spot now that I know for sure that it definitely works, and I'll let you know.

@superryeti
Copy link
Owner

From what I have experienced. The most data I could pull from twitter for a single hashtag was around 5k.

It turns out if you browse twitter manually and keep scrolling they will stop showing tweets after a certain number(which seems to differ according to hashtags). I verified this manually.

so you could use large number of related hashtags or crawl frequently say in a few days. if the tag is popular let's say it gets 1000 tweets per day. Then you can end up with 10-20k in a week.

The best settings I found is concurrency=2 and download_delay=0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants