-
Notifications
You must be signed in to change notification settings - Fork 762
Avoiding Too Much Dynamic Content
Suppose you want to only crawl pages from a particular host (http://www.foo.org/) and you also want to avoid crawling too many pages of a dynamically generated calendar. Let us assume the calendar is accessed by passing a year, month, and day to the calendar directory. For example, .
When you create the crawl job you will specify a single seed: http://www.foo.org/. By default, your new crawl job will use the DecideRuleSequence, which will contain a default set of DecideRules. One of the default rules is the SurtPrefixedDecideRule, which tells Heritrix to accept any URI's that match the seed URI's SURT prefix: http://(org,foo,www,)/. If the URI http://foo.org/ is encountered, it will be rejected since its SURT prefix, http://(org,foo,) does not match the seed's SURT prefix. To allow both foo.org and www.foo.org to be captured, you could add two seeds: http://www.foo.org/ and http://foo.org/. To allow every subdomain of foo.org to be crawled, you can add the seed http://foo.org. Note the absence of a trailing slash.
Delete the TranclusionDecideRule, since this rule has the potential to lead Heritrix onto another host. For example, if a URI returns a 301 response code (move permanently) or 302 (found) response code as well as a URI that contains a different host name than the seeds, Heritrix would accept this URI using the TransclusionDecideRule. Removing this rule will keep Heritrix from straying off of our www.foo.org host.
The PathologicalPathDecideRule and the TooManyPathSegmentsDecideRule will allow Heritrix to avoid some types of crawler traps. The TooManyHopsDecideRule will keep Heritrix from straying too far from the seed. In this way the calendar wont' trap Heritrix in an infinite loop. By default the max hops is set to 20, but this value can be changed by editing the crawler-beans.cxml file.
Alternately, you can add the MatchesFilePatternDecideRule. Set usePresetPattern to CUSTOM and set the regexp to something like: .foo\.org(?!/calendar).|.*foo\.org/calendar?year=200[56].*
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse