Skip to content

Crawler configuration

Péter Bencze edited this page May 31, 2019 · 7 revisions

Configure the crawler

A CrawlerConfiguration specifies the behavior of the crawler. These settings come into effect when the crawler is started. To create a configuration, use the CrawlerConfigurationBuilder class.

The default settings are the following:

  • Crawl strategy: Breadth-first
  • Duplicate request filter: Enabled
  • Offsite request filter: Disabled
  • Max crawl depth: None
  • Crawl delay strategy: Fixed, 0 second (no delay)

Add crawl seeds

Crawl seeds are the first requests fed to crawler. They will be the first one to be crawled.

Always make sure that the URLs you feed to the crawler are valid and well-formed!

Use:

There are 2 strategies available:

  • Breadth-first: orders requests by the lowest crawl depth
  • Depth-first: orders requests by the highest crawl depth

The duplicate request filter makes sure that a URL is only crawled exactly once.

Please note that if you disable it, your crawler could easily get into a crawl loop. It is recommended to leave it enabled.

The offsite request filter makes sure that only those URLs are crawled which belong in the allowed crawl domain (see below).

Add allowed crawl domains

This setting has effect only when the offsite request filter is enabled.

Allowed crawl domains represent internet domains in which crawling is permitted. You should specify all the domains explicitly.

Use:

Crawl delay strategies define how the delay between each request is determined.

There are 3 strategies available:

  • Fixed: the delay is constant and equals to the duration specified in the configuration
  • Adaptive: the delay corresponds to the page loading time, if it is between the specified range, otherwise the minimum or maximum duration is used
  • Random: the delay is randomized between the specified minimum and maximum range

This setting has effect only when fixed crawl delay strategy is used.

0 means no delay.

This setting has effect only when adaptive or random crawl delay strategy is used.

The duration should be less than the maximum and cannot be negative.

The default minimum is 1 second.

This setting has effect only when adaptive or random crawl delay strategy is used.

The duration should be higher than the minimum.

The default maximum is 1 minute.

0 means no limit.


Example:

CrawlerConfiguration config = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
        .setCrawlDelayStrategy(CrawlDelayStrategy.ADAPTIVE)
        .build();
Clone this wiki locally