SpiderStack is a simple command line crawler based on the powerful Crawlee library.
The philosophy behind SpiderStack is to provide a simple and intuitive crawler that can be easily integrated into your workflow.
To install SpiderStack locally on your system (Linux/MacOS), clone the repository and run the following commands:
- Clone repository:
git clone https://github.com/slowpoison/SpiderStack.git
- Navigate to project directory:
cd SpiderStack
- Install dependencies and build the package (if using a Python virtual environment):
pnpm run crawl <url>
SpiderStack provides a flexible command-line interface with various configuration options:
USAGE pnpm run crawl [options]
ARGUMENTS startUrl Starting URL to crawl
OPTIONS -d, --max-depth Maximum crawl depth (default: 3) -p, --max-pages Maximum number of pages to crawl (default: 100) -c, --concurrency Number of concurrent requests (default: 10) -t, --timeout Navigation timeout in seconds (default: 30) -w, --wait-until When to consider navigation finished: domcontentloaded, load, networkidle (default: domcontentloaded) -o, --output Output dataset name (default: crawler-results) --follow-external Follow links to external domains (default: false) --headless Run browser in headless mode (default: true) --proxy Proxy URL to use --user-agent Custom user agent string -v, --verbose Enable verbose logging -h, --help Display help for command
EXAMPLES
crawl https://example.com
crawl https://example.com -d 5 -p 200
crawl https://example.com -c 20 -t 60
crawl https://example.com -o my-crawl-results
- Fork the project repository on GitHub.
- Write tests where applicable and ensure your changes don't break existing functionality (see testing guide).
- Create a comprehensive commit message that includes what you did, why it was necessary, and how to revert if needed.
- Submit a pull request for review by opening an issue on GitHub or directly submitting the pull request in our repository.
Run tests using the following command:
pnpm run test
SpiderStack is open-sourced under the MIT License. The complete license text can be found in the LICENSE file at the root of this project.
SpiderStack is built upon Crawlee using TypeScript.