-
Notifications
You must be signed in to change notification settings - Fork 15
Crawler configuration
A CrawlerConfiguration specifies the behavior of the crawler. These settings come into effect when the crawler is started. To create a configuration, use the CrawlerConfigurationBuilder class.
The default settings are the following:
- Crawl strategy: Breadth-first
- Duplicate request filter: Enabled
- Offsite request filter: Disabled
- Max crawl depth: None
- Crawl delay strategy: Fixed, 0 second (no delay)
Crawl seeds are the first requests fed to crawler. They will be the first one to be crawled.
Always make sure that the URLs you feed to the crawler are valid and well-formed!
Use:
- addCrawlSeed: to add a single crawl seed
- addCrawlSeeds: to add a list of crawl seeds
There are 2 strategies available:
- Breadth-first: orders requests by the lowest crawl depth
- Depth-first: orders requests by the highest crawl depth
The duplicate request filter makes sure that a URL is only crawled exactly once.
Please note that if you disable it, your crawler could easily get into a crawl loop. It is recommended to leave it enabled.
The offsite request filter makes sure that only those URLs are crawled which belong in the allowed crawl domain (see below).
This setting has effect only when the offsite request filter is enabled.
Allowed crawl domains represent internet domains in which crawling is permitted. You should specify all the domains explicitly.
Use:
- addAllowedCrawlDomain: to specify a single allowed crawl domain
- addAllowedCrawlDomains: to specify a list of allowed crawl domains
Crawl delay strategies define how the delay between each request is determined.
There are 3 strategies available:
- Fixed: the delay is constant and equals to the duration specified in the configuration
- Adaptive: the delay corresponds to the page loading time, if it is between the specified range, otherwise the minimum or maximum duration is used
- Random: the delay is randomized between the specified minimum and maximum range
This setting has effect only when fixed crawl delay strategy is used.
0 means no delay.
This setting has effect only when adaptive or random crawl delay strategy is used.
The duration should be less than the maximum and cannot be negative.
The default minimum is 1 second.
This setting has effect only when adaptive or random crawl delay strategy is used.
The duration should be higher than the minimum.
The default maximum is 1 minute.
0 means no limit.
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
.setOffsiteRequestFilterEnabled(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
.setCrawlDelayStrategy(CrawlDelayStrategy.ADAPTIVE)
.build();