-
Notifications
You must be signed in to change notification settings - Fork 5
Technical Abstract
This crawler is built on the Scrapy framework in Python. This framework is used for scraping specific data from websites.
The crawler is divided into three major modules:
- The AYpiItem - This module defines what information is to be scraped from the pages. This is commonly referred to as item.
- The AYpiSpider - This module is commonly referred as spider. It opens links as http object, parses the pages and scrapes specific data.
- The AYpiPipeline - This module is used for post-processing. It is commonly reffered to as pipeline.
There is also a Settings module, which defines the various configuration settings of the project.
The crawler starts by taking seed URLs which is defined inside the spider. The underlying framework then iterates over the seed URLs, recieves a http response object, parses the HTML page and builds up a parse tree in memory.
Note: The set of seed URLs, are known to have the 'foruri' attribute.
The spider then selects the relevant nodes from the tree (the attributes defining a renarration that are "foruri" , "rec") using XPath selectors.
Then the items which are scraped by the spider, is processed by the pipelines module and it stores the item as JSON, in an external file, which is used for searching.