IRE 2024: Web scraping with Python

This repo contains materials for a half-day workshop at the IRE 2024 conference in Anaheim on using Python to scrape data from websites.

The session is scheduled for Sunday, June 23, from 9 a.m. - 12:30 p.m. in room Orange County Ballroom 3.

Open the cmd application. Copy and paste this text and hit enter:

cd Desktop\hands_on_classes\20240623-sunday-web-scraping-with-python-pre-registered-attendees-only && .\env\Scripts\activate && jupyter lab

Do you really need to scrape this?
Process overview:
- Fetch, parse, write data to file
- Some best practices
  - Make sure you feel OK about whether your scraping project is (legally, ethically, etc.) allowable
  - Don't DDOS your target server
  - When feasible, save copies of pages locally, then scrape from those files
  - Rotate user-agent strings and other headers if necessary to avoid bot detection
Using your favorite brower's inspection tools to deconstruct the target page(s)
- See if the data is delivered via undocumented API to the page in a ready-to-use format, such as JSON (example 1, example 2) -- Postman or similar software is handy for testing out API calls
- Is the HTML part of the actual page structure, or is it built on the fly when the page loads? (example)
- Can you open the URL directly in an incognito window and get to the same content, or does the page require a specific state to deliver the content (via search navigation, etc.)? (example)
- Are there URL query parameters that you can tweak to get different results? (example)
Choose tools that the most sense for your target page(s) -- a few popular options:
- requests and BeautifulSoup
- playwright (optionally using BeautifulSoup for the HTML parsing)
- scrapy for larger spidering/crawling tasks
Overview of our Python setup today
- Activating the virtual environment
- Jupyter notebooks
- Running .py files from the command line
Projects in this repo:

Try GitHub Actions if you need to put your scraper on a timer (you could also drop your script on a remote server, such as DigitalOcean, PythonAnywhere or Heroku, with a crontab configuration
Tipsheet on inspecting web elements
Tipsheet on saving HTML files before scraping them
Tipsheet with some miscellaneous scraping tips

Install Python, if you haven't already (here's our guide)
Clone or download this repo
cd into the repo directory and install the requirements, preferably into a virtual environment using your tooling of choice: pip install -r requirements.txt
playwright install
jupyter lab to launch the notebook server

Provide feedback