The Scout-Crawler is a system based on StormCrawler to crawl a specified list of domains completely. It then hands the crawled web pages to other components via RabbitMQ and notifies when a domain has finished crawling.
These instructions will help you setting up and running the system.
- Java 1.8
- Storm 1.2.1
- Elasticsearch 6
- Kibana or Grafana (optional to see metrics)
- RabbitMQ (if you want to store the crawled pages somewhere)
For the following running the topology guide it is assumed that you have the following files created in the project path
Create a list of URLs you want to inject into Elasticsearch to be crawled
https://books.toscrape.com
https://qoutes.toscrape.com
We also track an identifier of the domain through the architecture which can be specified here as an attribute of metadata. The attribute which is tracked has to be separated from the URLs with a tab
https://books.toscrape.com shopId=1234
https://qoutes.toscrape.com shopId= 54321
For an easy development and deployment it is handy to use a properties file which is used to substitute parameters in the configuration. Here are some suggested parameters which can be replaced and kept separate for production and development
es.address: http://localhost:9200/
fetcher.threads: 10
crawler.memory: 2048
This file is currently used to store all access data and is also referenced in the es-crawler.flux file under resources. Here we store the access data for RabbitMQ
#The configuration for out RabbitMQ
config:
rabbitmq.host: "some.host.de"
rabbitmq.port: 5672
rabbitmq.username: "admininstrator"
rabbitmq.password: "abc"
# necessary as otherwise it will get a null pointer exception
rabbitmq.ha.hosts: ""
A tutorial on how to install storm can be found here. The following suggestions assume that you use Ubuntu 16.04. For comfortable use set the environment variable for Storm
export PATH=$PATH:/usr/local/storm/bin
To make sure all systems run smoothly create startup scripts for the services in /etc/systemd/system/
[Unit]
Description=Runs the storm nimbus service
[Service]
Type=simple
User=storm
Group=storm
ExecStart=/usr/local/storm/bin/storm nimbus
Restart=on-failure
SyslogIdentifier=storm-nimbus
[Install]
WantedBy=multi-user.target
[Unit]
Description=Runs the storm supervisor
[Service]
Type=simple
User=storm
Group=storm
ExecStart=/usr/local/storm/bin/storm supervisor
Restart=on-failure
SyslogIdentifier=storm-supervisor
[Install]
WantedBy=multi-user.target
[Unit]
Description=Runs the storm ui service
[Service]
Type=simple
User=storm
Group=storm
ExecStart=/usr/local/storm/bin/storm ui
Restart=on-failure
SyslogIdentifier=storm-ui
[Install]
WantedBy=multi-user.target
A good introduction to StormCrawler in general can be found here. To get the crawler running
- Clone the repository
- Change the project directory and build the jar with
mvn clean package
storm jar target/spikeStormCrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-injector.flux --filter prod.properties
The seed URLs will be read from the seeds.txt in the folder and injected into the status index of Elasticsearch.
NOTE: Storm has to be running this includes
- Supervisor
- Nimbus
- UI
storm jar target/spikeStormCrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote es-crawler.flux --filter prod.properties
This will deploy the crawler on the storm topology where it can be monitored with the Storm UI.
NOTE: In the current topology it is required that RabbitMQ is running with the specified exchange name and routing key. There the crawled pages will be stored. Instead of storing them in RabbitMQ you can also use the BPFileBolt to store them as files.
The current topology is set up to crawl a specified list of domains one time. It used Elasticsearch as a storage for the URLs which have either been DISCOVERED
or FETCHED
. The crawled pages will be stored in a RabbitMQ. The crawler also notices within 30 seconds when a domain is crawled and sends a message with the shopId of the finished domain to another queue.
The crawler can be configured in the following files
This is the file where you configure the topology of the crawler. It states where the configuration files can be found, which Spouts and Bolts will be used and how they communicate via streams. Instead of the RabbitMQ Bolt to write the web pages to a queue, you could use the BPFileBolt instead which would write them to a file.
This topology is used to inject the specified URLs into Elasticsearch which will then be used for crawling. Injection during the crawling process should not pose a problem as Elasticsearch sorts all URLs by nextFetchDate and groups them into queues according to their host.
This configuration file is the main configuration for the crawler. Here you can set all parameters which are important like the crawling delay per page, the memory which should be given to the crawler or also which metadata like the shopId should be carried through the topology. Learn more about the configuration on this wiki page of StormCrawler
This configuration file is the configuration for Elasticsearch. Here you can find the parameters which specify for example which index should be for used for the metrics or the status.
A good inspiration how StormCrawler could be scaled for massive crawls can be found in this Benchmark of StormCrawler.
To get an impression how StormCrawler is running you can look at two places
- Storm UI - When you installed storm and kept the standard configuration than you access the storm UI on port 8080 where statistics about storm can be found
- Kibana Dashboard - When you follow the tutorial you will have configured Elasticsearch and Kibana properly. Then you can find the dashboard under the category Dashboard. Hereyou will find current stats about the fetching process.
- Grafana - An even better visualization of the crawling process with more metrics can be found when using Grafana. See the provided Dashboard of Stormcrawler
Storm will create logs which you will see in the directory where storm is installed in the directory called logs. The most important logs will be:
- nimbus.log This will tell you for example when the crawler will be started, stopped or halted
- worker.log In the folder worker-artifacts you can find a worker.log in the directory corresponding to the current typology. Here you can see which pages will be fetched and parsed as well as how the crawler is interacting with Elasticsearch or throws exceptions.
To not use the method of injecting URLs manually into the topology a improvement would be to program a spout which emits after a specified time URLs into Elasticsearch.
There could be the feature to control the crawler via another component. This could either be done by a Spring REST Service) from outside by controlling directly Elasticseach or by a component inside storm like a bolt (see the rest-express framework). This component could implement features like
- Aborting the crawling for specified domains by deleting all URLs of the domain in Elasticsearch
- Starting the injection of specified domains
also see this for advantages and disadvantages of the two approaches.
Use selenium/headless chrome for problematic web shops which need to execute Javascript on the webpage
Integrated in StormCrawler is already the option to use a remoteDriver which controls for example a Google Chrome browser. For this use the tool chromedriver and see the old Branch or crawler-default.yaml on how to activate the feature.
When you are crawling a web shop and are only interested in the articles it might be wise to skip non-article pages like the FAQs or a Blog of the web shop. In order to do this you could find a URL pattern which guarantees the matching of unnecessary pages. A URL filter pattern can be added in the src/main/ressources/urlfiltersCustom.json to skip unnecessary pages.
For example in viking.de/robots.txt you can see the rule which is not yet supported:
# only visit between 23:00 and 06:45 UTC
Visit-time: 2300-0645
Currently the component BPFinishedShopBolt has a InMemory HashSet of Domains for which a notification already has been send. When the crawler will be restarted during the process it will send out notifications for all domains which have been finished until that point.
Currently we do an aggregation query which asks per domain the number of URLs which have the status DISCOVERED
. This is quite costly when Elasticsearch contains a lot of URLs. With the help of a custom bolt this could be more efficient. This bolt could track which domains are currently crawled and gives a signal when one domains has not been fetched after a specified time. Then the previous costly query could be executed which would be overall more efficient.
Currently if you want to explore the Elasticsearch database, Kibana will not show URLs with the status FETCHED
even though they are in Elasticsearch. This could also be a bug from the configuration here as this apparently appears on other setups.
It seems that the crawler is crawling in intervals of 5 to 10 minutes. Possible reasons for this problem could be
- In the current setup there is not enough Memory for sufficient fetcher threads --> a lot of scheduling
- Not enough shards from Elasticsearch are available to give sufficient URLs
- A bug in StormCrawler itself
- A low TTL value for URLs which have to be picked up and fetched see here
This project was done within a Bachelor project of the Hasso Plattner Insitute in cooperation with Idealo Internet GmbH Thanks to my Team
- Leonardo Hübscher
- Dmitrii Zhamanakov
- Tom Schwarzburg
- Daniel Lindner