A Python script that scans a web page of a given URL and validates the links on it.
If the page has a link to the same hostname as the URL given by the user, its destination page also scanned. It works like a web crawler, so potentially the whole site will be scanned. Be aware that this may take hours.
All broken links found are saved on broken-links-[date]-[time]-[random-ID].txt.
Chrome 75 (put the proper driver in drivers folder if you have another version)
Python 3 (it has been tested on Python 3.7)
Some additional Python modules (check the script)
The script scans pages using Selenium to account for links that may be injected via JavaScript.
Then, each found link is validate with Requests in a concurrent (multi-thread) fashion.
If you don't care about rendering JavaScript, the find_broken_links_req.py script doesn't use Selenium and thus is slightly faster.
A single-thread script is provided for benchmarking.
- Better exception handling
- Validate links in other tags besides
<href a=...>
(like<img src=...>
) - Timeout
- Link depth limit
- Proxy support
- NTLM support
- Selenium driver parallelization