-
Notifications
You must be signed in to change notification settings - Fork 762
Deduping (Duplication Reduction)
Starting in release 1.12.0, a number of Processors can cooperate to carry forward URI content history between crawls (see org.archive.crawler.processor.recrawl package JavaDocs). This reduces the amount of duplicate material downloaded or stored in later crawls.
Heritrix 1.x does not support running the same crawl more than once, so one crawl will need to be configured for storing duplication reduction data, and another crawl will need to be configured for loading duplication reduction data. e.g. excerpt from testing for HER-1627:
- add FetchHistory and PersistLog processors after
FetchHttp
org.archive.crawler.processor.recrawl.FetchHistoryProcessor
org.archive.crawler.processor.recrawl.PersistLogProcessor
-
after PreconditionEnforcer, before FetchDNS
org.archive.crawler.processor.recrawl.PersistLoadProcessor
-
after FetchHTTP
org.archive.crawler.processor.recrawl.FetchHistoryProcessor
- preload-source:
${HERITRIX_HOME
}/jobs/${JOB
}/logs/persistlog.txtser.gz
Heritrix 3.x allows for running the same crawl repeatedly, but
requires a different configuration for the crawl run which stores
deduplication data, and the crawl run which loads deduplication data
as described in Duplication Reduction
Processors. The same model is
followed for H1, except using the Spring-world crawler beans CXML
(crawler-beans.cxml
) for configurating.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse