Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the designed stop situation ? & Question about the shelve #11

Open
Chin-I opened this issue Feb 2, 2016 · 1 comment
Open

What's the designed stop situation ? & Question about the shelve #11

Chin-I opened this issue Feb 2, 2016 · 1 comment
Labels

Comments

@Chin-I
Copy link

Chin-I commented Feb 2, 2016

Hello lordnahor. Thanks for your sharing of the crawler on git hub.

Recently I try to use the http://www.ics.uci.edu/ as the seed to crawl.
First time I crawl 10 hour to get Persistent.shelve about 150MB.
Second time, I stop at 300MB.
But I wander is there any designed "stop" situation like run out the frontier or just stop by accidentally.

One more question, I want to double check can I read "text" from the shelve file ?
Cause when I execute
d=shelve.open("Persistent.shelve.db")
print "Persistent.shelve.db",d
What I get is just
'http://www.ics.uci.edu/grad/courses/listing.php': (True, 3), 'http://www.ics.uci.edu/prospective/ko/degrees/business-information-management': (False, 4), 'http://hombao.ics.uci.edu?s=opportunities': (False, 4), 'http://asterix.ics.uci.edu/talks.html': (False, 4),

Thanks for your answer (..)

@rohan-achar
Copy link
Member

There are many reasons that the crawler can stop:
The default case is when the workers have not received a url from the frontier for (config.FrontierTimeOut) seconds.
It can be that the frontier has no more urls left to crawl (and the crawl is complete).

In any case, the crawler will stop with a dump of what is left in the Frontier. If that set was empty, then there were no urls left in the frontier to crawl. If that set was not empty, something else killed the crawler, and it can be resumed on restart.

Not sure what you mean by read "text" from the shelve. The method you used to access the shelve file is the right way to do it.
The Persistent file is a dictionary. of : (, depth)
The urls that have set to False, will be added to the Frontier on restart of the crawler (That is how the resume function works)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants