Skip to content

Latest commit

 

History

History
59 lines (48 loc) · 4 KB

README.md

File metadata and controls

59 lines (48 loc) · 4 KB

IRE 2024: Web scraping with Python

This repo contains materials for a half-day workshop at the IRE 2024 conference in Anaheim on using Python to scrape data from websites.

The session is scheduled for Sunday, June 23, from 9 a.m. - 12:30 p.m. in room Orange County Ballroom 3.

First step

Open the cmd application. Copy and paste this text and hit enter:

cd Desktop\hands_on_classes\20240623-sunday-web-scraping-with-python-pre-registered-attendees-only && .\env\Scripts\activate && jupyter lab

Course outline

  • Do you really need to scrape this?
  • Process overview:
    • Fetch, parse, write data to file
    • Some best practices
      • Make sure you feel OK about whether your scraping project is (legally, ethically, etc.) allowable
      • Don't DDOS your target server
      • When feasible, save copies of pages locally, then scrape from those files
      • Rotate user-agent strings and other headers if necessary to avoid bot detection
  • Using your favorite brower's inspection tools to deconstruct the target page(s)
    • See if the data is delivered via undocumented API to the page in a ready-to-use format, such as JSON (example 1, example 2) -- Postman or similar software is handy for testing out API calls
    • Is the HTML part of the actual page structure, or is it built on the fly when the page loads? (example)
    • Can you open the URL directly in an incognito window and get to the same content, or does the page require a specific state to deliver the content (via search navigation, etc.)? (example)
    • Are there URL query parameters that you can tweak to get different results? (example)
  • Choose tools that the most sense for your target page(s) -- a few popular options:
  • Overview of our Python setup today
    • Activating the virtual environment
    • Jupyter notebooks
    • Running .py files from the command line
  • Projects in this repo:

Other resources

Running this code at home

  • Install Python, if you haven't already (here's our guide)
  • Clone or download this repo
  • cd into the repo directory and install the requirements, preferably into a virtual environment using your tooling of choice: pip install -r requirements.txt
  • playwright install
  • jupyter lab to launch the notebook server