Crawler is a simple multi-threaded web crawler which fetches URLs uisng BFS and outputs crawl results to console and log as the crawl proceeds. It starts with a given set of URLs, and keeps crawling until user enters CTRL-C, or the number of crawled pages reaches the specified count. This implementation only takes URLs from <a href> tags and only processes absolute links.
Crawler accepts one positional argument, and two optional arguments. Run 'python crawler.py -h' for details.
- Run crawler with a single seed url and 20 (default) worker threads. Stop after crawling 100 (default) urls.
python crawler.py https://source.android.com/setup/start/build-numbers
- Run crawler with a single seed url and 10 worker threads. Stop after crawling 150 urls.
python crawler.py https://source.android.com/setup/start/build-numbers -c 150 -w 10
- Run crawler with two seed urls and 30 work threads. Stop after crawling 300 urls.
python crawler.py https://source.android.com/setup/start/build-numbers https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_processors -c 300 -w 30
Press Ctrl+C
Crawler outputs to both stdout and log. The output is formated as below
Requirements: Python 3.6 or above (and pip) is installed and its path is in system path
- Download crawler.py and buildcrawler.bat
- Run buildcrawler.bat to setup virtualenv to run crawler: buildcrawler.bat
TBD
- Run activatecrawler.bat to activate virtualenv (if not activated yet) - virtualenv is automatically activated when running buildcrawler.bat
- Run crawler - see examples in Usage section
TBD