Skip to content

Deduping (Duplication Reduction)

Alex Osborne edited this page Jul 4, 2018 · 2 revisions

Starting in release 1.12.0, a number of Processors can cooperate to carry forward URI content history between crawls (see org.archive.crawler.processor.recrawl package JavaDocs). This reduces the amount of duplicate material downloaded or stored in later crawls.

H1 dedupe configuration

Heritrix 1.x does not support running the same crawl more than once, so one crawl will need to be configured for storing duplication reduction data, and another crawl will need to be configured for loading duplication reduction data. e.g. excerpt from testing for HER-1627:

configure persist STORE crawl

  • add FetchHistory and PersistLog processors after FetchHttp
    org.archive.crawler.processor.recrawl.FetchHistoryProcessor
    org.archive.crawler.processor.recrawl.PersistLogProcessor

configure persist LOAD crawl

  • after PreconditionEnforcer, before FetchDNS
    org.archive.crawler.processor.recrawl.PersistLoadProcessor
  • after FetchHTTP
    org.archive.crawler.processor.recrawl.FetchHistoryProcessor
  • preload-source:
    ${HERITRIX_HOME}/jobs/${JOB}/logs/persistlog.txtser.gz

H3 dedupe configuration

Heritrix 3.x allows for running the same crawl repeatedly, but requires a different configuration for the crawl run which stores deduplication data, and the crawl run which loads deduplication data as described in Duplication Reduction Processors. The same model is followed for H1, except using the Spring-world crawler beans CXML (crawler-beans.cxml) for configurating.

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally