Skip to content
jasper-zanjani edited this page Aug 6, 2020 · 1 revision

Best used to obtain one "stream" of data at a time, without trying to obtain data from different pages

scrapy runspider spider.py -o file.json

Scrapy shell

Display HTML source of the scraped page

print(response.txt)

Get {URL}

fetch('url')

Select a CSS selector

# Returns a `SelectorList`
response.css('p')
# Retrieve full HTML elements
response.css('p').extract()

Retrieve only the text within the element

response.css('p::text').extract()
response.css('p::text').extract_first()
response.css('p::text').extract()[0]

Get the href attribute value for an anchor tag

response.css('a').attrib['href']

Launch Scrapy shell and scrape $URL

scrapy shell $URL

Make a default spider named {quotes} that will be restricted to {domain}

scrapy genspider quotes domain
scrapy runspider scrapy1.py

Run a spider, saving scraped data to a JSON file

scrapy runspider spider.py -o items.json

Method which contains most of the logic of the spider, especially after the yield keyword. For multiple items, a structural basis for iteration must be found and for each iteration, data is yielded

Pagination

Extract URL from link using standard CSS selection techniques

Add the domain name to a relative link

response.urljoin()

Recursively call the parse method again on the next page

yield scrapy.Request(url=next_page_url, callback=self.parse)

Scrape detail pages

  • parse_details would be a spider method sibling to the main parse method
  • if a detail page has more information than the main, then the yield keyword should be in parse_details
yield scrapy.Request(url={url}, callback=self.parse_details)