Skip to content

Latest commit

 

History

History
35 lines (24 loc) · 1.37 KB

README.md

File metadata and controls

35 lines (24 loc) · 1.37 KB

Python-Sitemap

Simple script to crawl websites and create a sitemap and assets output files

Goal: Write a web crawler without using existing crawl frameworks. Given a URL, the crawler only visit HTML pages within the same domain and not follow external links (e.g. Facebook, Twitter).

Crawler output a site map, and for each page a list of assets (e.g. CSS, Images, Javascripts) and links between pages.

Warning : This script only works with Python3.5+

Setup Instructions

  • $ pip install aiohttp bs4

Parameters

-output_sitemap default='sitemap.txt' Sitemap file name
-output_assets  default='assets.txt'  Assets file name
-log_file       default='error.log'   Error log file name
-workers        default=Сpu сount     Count of process for processing (N)
-threads        default=10            Count of treads per process (M)
-max_visited    default=1000          Max count of visited urls, stops when reached
-timeout        default=60            Timeout for waiting for response in sec

Samples usage