Crawler

Overview

Crawler is a simple multi-threaded web crawler which fetches URLs uisng BFS and outputs crawl results to console and log as the crawl proceeds. It starts with a given set of URLs, and keeps crawling until user enters CTRL-C, or the number of crawled pages reaches the specified count. This implementation only takes URLs from <a href> tags and only processes absolute links.

Usage

Crawler accepts one positional argument, and two optional arguments. Run 'python crawler.py -h' for details.

Examples

Run crawler with a single seed url and 20 (default) worker threads. Stop after crawling 100 (default) urls.

python crawler.py https://source.android.com/setup/start/build-numbers

Run crawler with a single seed url and 10 worker threads. Stop after crawling 150 urls.

python crawler.py https://source.android.com/setup/start/build-numbers -c 150 -w 10

Run crawler with two seed urls and 30 work threads. Stop after crawling 300 urls.

python crawler.py https://source.android.com/setup/start/build-numbers https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_processors -c 300 -w 30

To stop crawler before it completes

Press Ctrl+C

Output

Crawler outputs to both stdout and log. The output is formated as below

Example

Install and Build

For Windows

Requirements: Python 3.6 or above (and pip) is installed and its path is in system path

Download crawler.py and buildcrawler.bat
Run buildcrawler.bat to setup virtualenv to run crawler: buildcrawler.bat

For Other Platforms

TBD

Run

For Windows

Run activatecrawler.bat to activate virtualenv (if not activated yet) - virtualenv is automatically activated when running buildcrawler.bat
Run crawler - see examples in Usage section

For Other Platforms

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md
WebCrawler_20210902-212646.txt		WebCrawler_20210902-212646.txt
buildcrawler.bat		buildcrawler.bat
crawler.py		crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler

Overview

Usage

Examples

To stop crawler before it completes

Output

Example

Install and Build

For Windows

For Other Platforms

Run

For Windows

For Other Platforms

About

Releases

Packages

Languages

jinglu-jlu/webcrawler

Folders and files

Latest commit

History

Repository files navigation

Crawler

Overview

Usage

Examples

To stop crawler before it completes

Output

Example

Install and Build

For Windows

For Other Platforms

Run

For Windows

For Other Platforms

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages