-
Notifications
You must be signed in to change notification settings - Fork 762
Configuring Jobs and Profiles
Creating crawl jobs (Creating a Job) and profiles (Creating a Profile) is the first step in the process of using Heritrix to crawl the Web. Configuring jobs and profiles is a more complicated process. The following section applies equally to configuring crawl jobs and profiles.
Note
- To edit a running crawl see Editing a Running a Job for more information.
Configuring a job or profile involves editing the crawler-beans.cxml
file. This file is a Spring
configuration file. The Spring framework is used to define the
properties of jobs and profiles. Each job is defined by Spring "beans"
that hold configuration data for the job. This section covers
configuring common properties of a crawl job or profile and describes
the various sections of the crawler-beans.cxml
file.
The first section of the crawler-beans.cxml
file allows the operator
to override any simple bean property, such as
metadata.operatorContactUrl
. For example, you could set Heritrix to
ignore cookies with the following configuration override.
Override Ignore Cookies Property
<!-- overrides from a text property list -->
<bean id="simpleOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer">
<property name="properties">
<value>
# This Properties map is specified in the Java 'property list' text format
# http://java.sun.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.Reader%29
metadata.operatorContactUrl=http://www.archive.org
metadata.jobName=basic
metadata.description=Basic crawl starting with useful defaults
ignoreCookies=true
##..more?..##
</value>
</property>
</bean>
For longer or more complicated overrides, the "longerOverrides" bean is available. It is used to override properties that have multiple values or that can be overridden with a bean. For example, you could configure multiple seeds with the following configuration.
Overriding seed values
<bean id="longerOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer">
<property name="properties">
<props>
<prop key="seeds.textSource.value">
# URLS HERE
http://www.myhost1.net
http://www.myhost2.net
http://www.myhost3.net/pictures
</prop>
</props>
</property>
</bean>
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse