Skip to content

Heritrix Installation

Alex Osborne edited this page May 8, 2020 · 7 revisions

A binary distribution of Heritrix can be downloaded from the webpage http://builds.archive.org/maven2/org/archive/heritrix/heritrix/. As of writing, the latest version is 3.4.0, and a direct link to the download is http://builds.archive.org/maven2/org/archive/heritrix/heritrix/3.4.0-SNAPSHOT/heritrix-3.4.0-20190828.200101-25-dist.tar.gz

Once downloaded, expand the file. This works on some platforms:

tar -xzf heritrix-3.4.0-20190828.200101-25-dist.tar.gz

You'll end up with a heritrix-{VERSION} directory. It contains the following subdirectories:

  • bin - contains shell scripts/batch files for lauching Heritrix.
  • lib - contains the third-party .jar files the Heritrix application requires to run.
  • conf - contains various configuration files (such as the configuration for Java logging, and pristine versions of the bundled profiles)
  • jobs - contains bundled crawl profiles (collections of settings), and is the default location where operator-created jobs are stored

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally