-
Notifications
You must be signed in to change notification settings - Fork 762
crawling JavaScript
Heritrix will collect referenced script files (as in <SCRIPT> elements). However, discovery of URIs inside javascript code is limited.
Heritrix does not currently include a browser-equivalent Javascript interpreter and page (DOM) model. It does scan script code for strings that appear likely to be absolute or relative URIs, and will treat these the same as other discovered outlinks. In many cases, this finds valuable content, while in others, it causes requests to invalid URIs at sites. (Usually this results in harmless 'not found' responses, but it can inconvenience sites, and so this 'speculative' crawling of possible URIs can be turned off.) This simple scanning will not find URIs that are dynamically composed from parts in scripts.
Heritrix may better simulate browsers to discover more links in the future. Two previous student 'Google Summer of Code' projects suggest possible approaches. The Browser Monkeys project remote-controls a Firefox instance to observe its pattern of followup requests, and thus gains the benefits of Firefox's own Javascript/DOM implementation. The Javascript Cloaking Detection project embedded the open-source Rhino Javascript engine and Lobobrowser mock-browser into Heritrix to discover links created by predictable Javascript actions. Neither of these techniques are yet integrated into an official Heritrix release or scheduled for a specific future release.
Even with better Javascript link-discovery, highly dynamic sites which involve repeated user-specific browser-to-server interaction (such as after a user login, and with 'AJAX' background operations refreshing only part of a page) will not be meaningfully harvestable with current automated techniques.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse