Skip to content
rohitstatic edited this page Mar 11, 2011 · 4 revisions

Technical Abstract

This crawler is built on the Scrapy framework in Python. This framework is used for scraping specific data from websites.

High-level Design

The crawler is divided into three major modules:

  1. The AYpiItem - This module defines what information is to be scraped from the pages. This is commonly referred to as item.
  2. The AYpiSpider - This module is commonly referred as spider. It opens links as http object, parses the pages and scrapes specific data.
  3. The AYpiPipeline - This module is used for post-processing. It is commonly reffered to as pipeline.

There is also a Settings module, which defines the various configuration settings of the project.

The crawler starts by taking seed URLs which is defined inside the spider. The underlying framework then iterates over the seed URLs, recieves a http response object, parses the HTML page and builds up a parse tree in memory.

Note: The set of seed URLs, are known to have the 'foruri' attribute.

Scraping the data

The spider then selects the relevant nodes from the tree (the attributes defining a renarration that are "foruri" , "rec") using XPath selectors.

Post-processing

Then the items which are scraped by the spider, is processed by the pipelines module and it stores the item as JSON, in an external file, which is used for searching.

Clone this wiki locally