-
Notifications
You must be signed in to change notification settings - Fork 762
Processor Settings
Alex Osborne edited this page Jul 4, 2018
·
2 revisions
Processors have settings that effect crawls. Each processor can be enabled or disabled. If disabled, the Processor will not be applied to any URI. The following processors have default settings configured by Heritrix.
- preferenceDepthHops - Number of hops (of any sort) from a seed up to which a URI has higher priority scheduling than any remaining seed. For example, if set to one, items one hop (link, embed, redirect, etc.) away from a seed will be scheduled with HIGH priority. If set to -1, no preferencing will occur, and a breadth-first search with seeds processed before discovered links will proceed. If set to zero, a purely depth-first search will proceed, with all discovered links processed before remaining seeds. Seed redirects are treated as one hop from a seed.
- preferenceEmbedHops - Number of embed hops (ERX) to bump to front of host queue.
- canonicalizationPolicy - Ordered list of URI canonicalization rules. Rules are applied in the order listed from top to bottom.
- queueAssignmentPolicy - Defines how to assign URIs to queues. Can assign by host, by ip, by SURT-ordered authority, by SURT-ordered authority truncated to a topmost-assignable domain, and into one of a fixed set of buckets (1k).
- uriPrecedencePolicy - Sets an integer precedence value on individual URIs when they are first submitted to a frontier for scheduling. A URI's precedence directly affects which URI queue it is placed in, but does not affect a queue's precedence relative to other queues unless a queue-precedence-policy that consults URI precedence values is chosen.
- costAssignmentPolicy - Calculates an integer 'cost' value for the given CrawlURI.
- recheckScope - Recheck if URI is in scope. This is meaningful if the scope is altered during a crawl. When URIs are added to queues they are checked against the scope. Setting this value to true forces the URI to be checked against the scope when it comes out of the queue, possibly after the scope is altered.
- blockAll- Block all URIs from being processed. This is most likely to be used in overrides to easily reject certain hosts from being processed.
- blockByRegex - Block all URIs matching the regular expression from being processed.
- allowByRegex - Allow only URIs matching the regular expression to be processed.
- ipValidityDurationSeconds - The minimum interval for which a dns-record will be considered valid (in seconds). If the record's DNS TTL is larger, that will be used instead.
-
robotsValidityDurationSeconds- The time in seconds that fetched
robots.txt information is considered valid. If the value is set to
'0', then the
robots.txt
information will never expire. -
calculateRobotsOnly - Whether to calculate the robot's status of
a URI, without actually applying any exclusions found. If true,
excluded URIs will only be annotated in the
crawl.log
, but still fetched.
- acceptNonDnsResolves - Whether or not to fall-back to InetAddress resolution if a DNS lookup fails. InetAddress resolution may use local 'hosts' files or other mechanisms.
- digestContent - Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
- digestAlgorithm - The algorithm (for example MD5 or SHA-1) used to perform an on-the-fly digest hash of retrieved content-bodies.
- timeoutSeconds - This setting determines how long an HTTP request will wait for a resource to respond. This setting should be set to a high value.
- maxLengthBytes - This setting determines the maximum number of bytes to download per document. When the limit is reached the document will be truncated. By default this setting is a very large value (in the exabyte range) that will theoretically never be reached.
-
maxFetchKBSec - The maximum rate in
KB/sec
to use when fetching data from a server. The default of 0 means no maximum. - defaultEncoding - The character encoding to use for files that do not have one specified in the HTTP response headers. The default is ISO-8859 -1.
-
shouldFetchBodyRule- DecideRules applied after receipt of HTTP
response headers but before download of the HTTP body. If any rule
returns FALSE, the fetch is aborted. Prerequisites such as
robots.txt
are excluded from filtering, i.e. they cannot be midfetch aborted. -
soTimeoutMs - If the socket is unresponsive for this number of
milliseconds, the request is cancelled. Setting the value to zero
(no timeout) is not recommended as it could hang a thread on an
unresponsive server. This timeout is used to time out socket opens
and socket reads. Make sure this value is less than
timeoutSeconds
for optimal configuration. This ensures at least one retry read. -
sendIfModifiedSince - Send
If-Modified-Since
header, if previousLast-Modified
fetch history information is available in URI history. -
sendIfNoneMatch - Send
If-None-Match
header, if previousEtag
fetch history information is available in URI history. -
sendConnectionClose - Send
Connection: close
header with every request. -
sendReferer- The
Referer
header contains the location the crawler came from. This is the page the current URI was discovered in. TheReferer
is usually logged on the remote server and can be of assistance to webmasters trying to figure out how a crawler got to a particular area on a site. -
sendRange- Send the
Range
header when there is a limit on the retrieved document size. This is for politeness purposes. TheRange
header states
that only the first n bytes are of interest. It is only pertinent ifmaxLengthBytes
is greater than zero. Sending theRange
header results in a
206 Partial Content
status response, which is better than cutting the response mid-download. On rare occasion, sending theRange
header will
generate416 Request Range Not Satisfiable
response. - ignoreCookies - Disable cookie handling.
-
sslTrustLevel- The SSL certificate trust level. The range is
from the default
open
(trust all certs including expired, selfsigned, and those for which we do not have a CA) throughloose
(trust all valid certificates including selfsigned), normal (all valid certificates not including selfsigned) and strict (Cert is valid and DN must match servername). -
acceptHeaders - Accept Headers to include in each request. Each
must be the complete header, e.g.,
Accept-Language: en
. - httpBindAddress- Local IP address or hostname to use when making connections (binding sockets). When not specified, uses default local address(es).
- httpProxyHost - The proxy host ip address.
- httpProxyPort - The proxy port.
- digestContent - Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
- digestAlgorithm - Specifies which algorithm (for example MD5 or SHA-1) is used to perform an on-the-fly digest hash of retrieved content-bodies.
- extractJavascript - If true, in-page Javascript is scanned for strings that appear to be URIs. This typically finds both valid and invalid URIs. Attempts to fetch the invalid URIs can generate webmaster concern over odd crawler behavior. Default is true.
-
extractValueAttributes- If true, strings that look like URIs
found in unusual places (such as form
VALUE
attributes) will be extracted. This typically finds both valid and invalid URIs. Attempts to fetch the invalid URIs may generate webmaster concerns over odd crawler behavior. Default is true. -
ignoreFormActionUrls - If true, URIs appearing as the
ACTION
attribute in HTML FORMs are ignored. Default is false. -
extractOnlyFormGets - If true, only
ACTION
URIs with aMETHOD
ofGET
(explicit or implied) are extracted. Default is true. -
treatFramesAsEmbedLinks- If true,
FRAME/IFRAME SRC
-links are treated as embedded resources (likeIMG
, 'E' hop-type). Otherwise they are treated as navigational links. Default is true. -
ignoreUnexpectedHtml - If true, URIs which end in typical
non-HTML extensions (such as
.gif
) will not be scanned as if it were HTML. Default is true. - maxElementLength - The maximum length of an HTML element.
- maxAttributeNameLength - The maximum length of an HTML attribute name.
- maxAttributeValueLength - The maximum length of an HTML attribute value.
- compress - If this setting is true, WARC file content will be compressed. Note that compression applies to each content item stored in the WARC.
- prefix - The prefix of the WARC filename.
- maxFileSizeBytes- This setting determines the maximum size in bytes for each WARC file. Once the WARC file reaches this size, no URIs will be written to it and another WARC file will be created to handle the remaining URIs. This setting is not a hard limit. If exceptionally large URIs are being downloaded, the WARC file may greatly exceed this limit. Content items stored in WARC files are never split between WARCs.
- storePaths- This setting is a list of paths into which WARC files will be written. If more than one path is specified, a round-robin approach will be used to choose the path. This setting is safe to change during a crawl.
- poolMaxActive- Heritrix maintains a pool of WARC files that are each ready to accept downloaded documents. This approach is used to prevent WARC writing from being a bottleneck in a multithreaded environment. This setting establishes the maximum number of such files to keep ready. The default value is five. For small crawls that you want to limit to a single WARC file, this setting should be set to one.
- maxWaitForIdleMs - controls how long a thread waits for an reusable writer before considering creating a new one. If creation isn't allowed, threads will wait indefinitely for a writer to become available.
- skipIdenticalDigests- Whether to skip the writing of a record when URI history information is available and indicates the prior fetch had an identical content digest. Default is false.
- maxTotalBytesToWrite - Total file bytes to write to disk. Once the size of all files on disk has exceeded this limit, this processor will stop the crawler. A value of zero means no upper limit.
-
directory - The directory in which the
storePaths
will be configured. -
writeRequests - Whether to write
request
type records. Default is true. -
writeMetadata - Whether to write
metadata
type records. Default is true. -
writeRevisitForIdenticalDigests- Whether to write
revisit
type records when a URI's history indicates the previous fetch had an identical content digest. Default is true. -
writeRevisitForNotModified - Whether to write
revisit
type records when a304-Not Modified
response is received. Default is true.
- seedsRedirectNewSeeds - If enabled, any URI found because a seed redirected to it (original seed returned 301 or 302), will also be treated as a seed.
- delayFactor - How many multiples of the last fetch elapsed time to wait before recontacting the same server.
-
minDelayMs- The minimum time to wait after a request has been
completed before recontacting the same server. This value overrides
the
delayFactor
. -
respectCrawlDelayUpToSeconds - Whether to respect a
Crawl-Delay
(in seconds) provided by the site'srobots.txt
. -
maxDelayMs- The maximum amount of time to wait after a request
has been completed before recontacting the same server. This value
overrides the
delayFactor
. - maxPerHostBandwidthUsageKbSec - The maximum per-host bandwidth usage.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse