-
Notifications
You must be signed in to change notification settings - Fork 0
scrapy
Best used to obtain one "stream" of data at a time, without trying to obtain data from different pages
scrapy runspider spider.py -o file.json
Display HTML source of the scraped page
print(response.txt)
Get {URL}
fetch('url')
Select a CSS selector
# Returns a `SelectorList`
response.css('p')
# Retrieve full HTML elements
response.css('p').extract()
Retrieve only the text within the element
response.css('p::text').extract()
response.css('p::text').extract_first()
response.css('p::text').extract()[0]
Get the href
attribute value for an anchor tag
response.css('a').attrib['href']
Launch Scrapy shell and scrape $URL
scrapy shell $URL
Make a default spider named {quotes} that will be restricted to {domain}
scrapy genspider quotes domain
scrapy runspider scrapy1.py
Run a spider, saving scraped data to a JSON file
scrapy runspider spider.py -o items.json
Method which contains most of the logic of the spider, especially after the yield
keyword. For multiple items, a structural basis for iteration must be found and for each iteration, data is yielded
Extract URL from link using standard CSS selection techniques
Add the domain name to a relative link
response.urljoin()
Recursively call the parse
method again on the next page
yield scrapy.Request(url=next_page_url, callback=self.parse)
-
parse_details
would be a spider method sibling to the mainparse
method - if a detail page has more information than the main, then the
yield
keyword should be inparse_details
yield scrapy.Request(url={url}, callback=self.parse_details)
- argparse ?
- array ?
- asyncio ?
- bisect ?
- csv ?
- ctypes ?
- curses ?
- datetime ?
- functools ?
- getpass ?
- glob ?
- heapq ?
- http ?
- json ?
- logging ?
- optparse ?
- os ?
- pathlib ?
- platform ?
- pythonnet ?
- random ?
- socket ?
- subprocess ?
- sqlite3 ?
- sys ?
- termcolor ?
- threading ?
- trace ?
- typing ?
- unittest ?
- urllib ?
- venv ?
- weakref ?
- winrm ?