Introduction | Setting Up | Coding Time | The Results | Optional Feature | Resources |
---|
haha, get it, it's an actual table 🤭
Hey there party people 👋!
How is everyone? That's a rhetorical question, I can't hear you... Anywho, I have put together a fun little template repository to help introduce the topic of Web Scraping in Python🐍
! We will go through the different requirements and installations to get started, how to go about laying out your code, actually coding, common mistakes, etc, etc.
Ready... Set... GO!
Before we can get to the fun stuff there are a couple of things you need to have installed. Please take the time to carefully go through the installation processes❤️:
This is the python package that allows us to interact with the WebDriver to run our code.
pip install selenium # or pip3 install selenium
We will be scraping using something called a ChromeDriver--a type of WebDriver specifically for Chrome. A WebDriver is an open source tool for automated testing of webapps across many browsers1.
- To download a ChromeDriver first check what
version
of Google Chrome you are currently running. - Then navigate here and click the download that matches the version number you just found.
- Finally,
extract
the chromedriver.exe file and save it in the same folder as your code
4. SQLite3 (for optional feature)
If you're interested in storing the results of your scraping in a quick and simple database or csv file!
pip install pysqlite3 # or pip3 install pysqlite3
Alrighty, assuming that is all done, it's time to get coding! To start, go ahead and open the scraper.py
file that came with this repository.
A breakdown of what each of the imports does
from selenium import webdriver # so we can instantiate a WebDriver
from selenium.webdriver.common.keys import Keys # let's us 'type' things in the browser (i.e. in the searchbar)
from selenium.webdriver.chrome.options import Options # so we can configure our WebDriver settings (e.g. how verbose it should be)
from selenium.webdriver.common.by import By # to let selenium find elements *by* different identifiers (e.g. by class)
import time # because sometimes we have to tell our program to wait a bit!
With that all sorted let's set up our webdriver.
Most of the time the following code won't change from project-to-project so don't feel bad just copy-pasting it whenever you need it!
# SETTING UP BROWSER
#-----------------------
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_experimental_option("detach", True)
chrome_options.add_argument("--log-level=3")
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
browser = webdriver.Chrome(options=chrome_options)
browser.set_window_size(1080, 960)
The Breakdown
-
chrome_options = Options()
allows you to configure your WebDriver to suit your needs. There are a gazillion-and-one different option arguments you can add and experiment with. -
chrome_options.add_argument("--headless")
makes sure that when you run the code, the actual chrome browser doesn't pop up. Comment this out for now 🤓 -
chrome_options.add_experimental_option("detach", True)
helps make sure the browser we control doesn't close every time our program finishes running! This helps us to see how far our program got/where an error is occuring/our victory! -
chrome_options.add_experimental_option("prefs",prefs)
handle any chrome notifications (e.g. Allow/Block permission notification boxes) that confuse our scraper😖 -
chrome_options.add_argument("--log-level=3")
to only show you important warnings (thank me later)- INFO = 0, WARNING = 1, LOG_ERROR = 2, LOG_FATAL = 3.
-
browser = webdriver.Chrome(options=chrome_options)
instantiates a ChromeDriver with the options we chose -
browser.set_window_size(1080, 960)
is just for funsies and, I think, pretty self-explanatory
N.B. You do not have to call your WebDriver 'browser', this is just my personal preference. Often, when you read things online, it will either be called either browser or driver
For the sake of this 'tutorial', we will be navigating to, and scraping from, Reddit as it is and, should continue to be, legal to scrape from. Please always make sure you double check the rules different sites have on web-scraping before you start a project!
# TO REDDIT WE GO
#-----------------------
reddit_url = "https://www.reddit.com"
browser.get(reddit_url)
But, quite frankly, it's not enough to just go to a website, we also want to be able to interact with it, right? Trick question: Right! Interacting with a website could mean:
- Clicking on a button
- Typing in a searchbar or comment section
- Pressing enter
- Scrolling
- etc
To keep things simple we'll just be focusing on the first three... but before we can do that we need to know how to find the elements we want to interact with. How do we find a button to click? Or a searchbar to type in? Check below!
There are several different ways to locate elements on a webpage using selenium. Here are the 4 methods I use most frequently:
So, going back to our Reddit example: We have navigated to the Reddit website, but now we want to find the searchbar so we can look for a specific subreddit.
# N.B. you tend to find that most searchbars' name is just 'q'
searchbar = browser.find_element(By.NAME, "q")
Again, however, there is more to be done! Finding an element is not the same as using that element. We can find a button but not necessarily use that button. Worry not though, using elements tends to be super easy! For our purposes, we will focus on:
- If we want to click on something (e.g. a button):
button = browser.find_element(By.ID, "some button id")
button.click()
- If we want to type into something (e.g. a searchbar):
searchbar = browser.find_element(By.NAME, "q")
searchbar.send_keys("this is something i want to type in the searchbar")
# searchbar.click() # sometimes you need this👀
searchbar.send_keys(Keys.RETURN) # presses 'Enter' (the same as clicking the search button)
Trust me with the above skills we just covered you are 90% of the way to launching your own scraper! Let's just put the final few pieces together. Here's the plan:
- Search for "Beans" in the searchbar on Reddit's homepage
- Click the
r/Beans
subreddit link - Get a list of all the post titles in the subreddit2
- Print it out to our terminal or insert optional feature here
Give those steps a try by yourself if you think you can, Step 3 is a little harder so don't feel shy taking a peek at my sample code below:
Steps One & Two
def find_subreddit(subreddit):
"""Game Plan:
- Navigates to Reddit
- Searches for the subreddit
- Clicks on link to subreddit
Args:
subreddit (str): the subreddit to be visited
"""
# Navigate to reddit
reddit_url = "https://www.reddit.com"
browser.get(reddit_url)
# Search for subreddit using searchbar
searchbar = browser.find_element(By.NAME, "q")
searchbar.send_keys("Beans")
searchbar.click()
searchbar.send_keys(Keys.RETURN)
# Click subreddit link
time.sleep(1)
subreddit_link = browser.find_element(By.CLASS_NAME, "_1Nla8vW02K39sy0E826Iug")
subreddit_link.click()
Step Three
def get_titles():
"""Game Plan:
- Choose how you want to find the title elements
- e.g by class name, tag name, xpath, etc
- Use browser.find_elements(.........)
- Convert each element in the list into text
Returns:
titles (list): a list of titles of posts found in the subreddit
"""
titles = []
# Get titles in raw format
raw_titles = browser.find_elements(By.CLASS_NAME, "_eYtD2XCVieq6emjKBH3m")
# Convert titles (which are of type 'WebElement') into their text
for title in raw_titles:
titles.append(title.text)
return titles
Step Four
def display(titles_to_display):
"""Game Plan:
- Display your results in a cute format <3
Args:
titles_to_display (list): the titles to be displayed cutely
"""
titles_to_display = set(titles_to_display) # getting rid of duplicates!!
random_ascii_art_from_the_internet = """
⠀⣰⡶⣶⣶⡶⠤⠶⢿⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣿⣿⣿⢻⣧⣀⠀⠀⣿⣿⣿⣏⠷⣦⣀⡀⠀⠀⠀⣀⣀⣀⣄⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⢿⣿⠙⠻⣿⣿⢶⣄⠙⠻⠟⠋⠀⠀⠈⣙⣿⠛⠛⢻⣹⣥⣿⣫⠼⠋⠙⠛⣦⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠉⠀⠀⠹⠏⠛⢿⣿⢦⣄⡀⠤⢤⣤⡀⠙⢠⡀⠈⠻⣦⣼⠇⠀⠀⠀⢸⡇⣿⠻⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⣇⠈⠉⢛⡟⠙⠃⠀⠘⣧⣀⣀⣈⣉⣀⠀⠀⠀⢠⡇⢸⣇⣈⢷⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⠀⣠⣾⠃⠀⠀⠀⢰⡏⠁⠀⠀⠈⠙⢷⡄⠀⠈⠳⠞⠓⢮⡉⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⢀⣤⣴⣾⡿⠿⢿⣿⢿⣿⠟⠁⣀⣀⣠⡴⠋⠀⠀⠀⠀⠀⠀⠀⣷⠀⠀⠀⠀⠀⠀⠙⢻⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣰⣏⡿⠋⠁⢀⣠⢞⣡⠞⢁⣠⠞⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⢰⡟⠀⠀⠀⠀⠀⣀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠠⢿⣿⠁⠀⢰⡿⠛⠋⢁⣴⠟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣰⡟⠀⣀⣀⡀⠀⣾⠉⠉⢻⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠘⠿⠞⠛⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⡾⠋⠀⠀⣯⠀⠉⣻⣯⡶⢲⡞⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣰⠞⠋⠀⠀⠀⠀⣸⠆⠠⣇⠀⠀⣾⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⠾⠋⠀⠀⠀⠀⠀⠀⠀⠈⠓⠢⣬⢻⣾⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⡴⠛⠁⠀⢀⣀⣀⢀⣀⠀⠀⠀⠀⠀⠀⣸⡿⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⠞⠉⠀⠀⠀⠀⠘⣇⠈⠉⠉⢳⡄⠀⠀⢀⡼⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡴⠟⠁⠀⠀⠀⠀⠀⠀⣠⠾⢀⡾⢳⡀⢳⣄⡴⠛⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⣴⠟⠁⠀⠀⠀⠀⠀⠀⠀⠰⡏⠀⢿⡀⠈⣧⡾⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⣾⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠹⢦⣀⣿⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣠⣶⠶⢶⣶⡶⠦⣄⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣼⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⠟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣤⠶⠛⠉⠀⠙⠦⣄⠈⣹⡄⠀⠉⡽⠶⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢠⡟⠀⠀⠀⠀⠀⠀⢀⡖⠒⢦⣤⣰⡟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣤⠟⢳⣄⠀⠀⠀⠀⠀⣿⠀⠛⠛⠢⠞⠁⢀⣘⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢸⡇⠀⠀⠀⠀⠀⢀⣼⡇⢸⡖⣾⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣼⢯⣀⣸⠃⠀⠀⠀⠀⠀⣿⣠⠴⢦⣄⣀⡼⠋⠀⠘⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢸⠃⠀⠀⠀⠀⢠⠟⢿⣿⣩⣴⢿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣾⣁⠴⠟⠉⢳⡄⠀⠀⠀⣀⣈⠀⠀⠀⠈⠁⠀⠀⠀⠀⣿⠀⠀⠀⠀⠰⣶⣶⢤⣄⠀⠀⠀
⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠈⢧⣀⡭⠤⣿⢈⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣰⣟⠉⢁⡴⠒⠒⠚⢁⣤⠞⠋⠉⠉⠛⠳⣄⠀⠀⠀⣤⠖⢒⣿⠀⠀⠀⠀⠀⠀⠈⢧⡈⢳⡄⠀
⠀⠀⠀⠀⠀⢹⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⠉⠙⣧⣄⠀⠀⠀⠀⢀⣠⡾⠋⠈⠉⠁⠀⠀⠀⣰⠟⠀⠀⠀⠀⠀⠀⠀⠈⢷⠀⠀⣸⣦⣿⡏⠀⠀⠀⠀⠀⠀⠀⠈⣷⠀⢿⡀
⠀⠀⠀⠀⠀⢸⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀⢹⡦⢸⡇⢹⡙⠓⣶⠚⠋⣿⠀⠀⠀⠀⠀⠀⠀⣼⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⢤⠟⢁⣛⡾⠁⠀⠀⠀⠀⠀⠀⠀⠀⣼⢳⠈⣧
⠀⠀⠀⠀⠀⠀⢿⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⢣⠀⣱⠀⣸⠀⣠⠟⠀⠀⠀⠀⠀⠀⣼⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠈⠉⠉⢹⡇⠀⠀⠀⠀⠀⠀⠀⠀⢠⣏⣘⣧⣿
⠀⠀⠀⠀⠀⠀⠈⢿⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣞⠀⣿⣋⠁⣸⠃⠀⠀⠀⠀⠀⠀⣴⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢿⡄⠀⢀⠼⣧⡀⠀⠀⠀⠀⠀⠀⣠⠟⠁⠉⢀⡏
⠀⠀⠀⠀⠀⠀⠀⠀⠹⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠁⠀⠀⠀⠀⠀⢀⣴⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⣷⡀⠘⠒⠚⠻⣶⣤⣤⡤⠶⣿⠁⠀⠀⢀⡿⠁
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⢶⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⡞⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣄⠀⠀⠀⣧⡙⢻⡶⠚⠁⠀⢀⡴⠟⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠛⠲⢤⣤⣤⣀⣀⣀⣀⣀⣤⣤⠴⠛⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠛⠶⢤⣤⣿⣾⣥⣤⠶⠛⠋⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⠲⣶⠒⠲⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣧⡀⠀⠀⠀⠀⠀⢰⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠀⠀⠀⠀⣠⠞⠂⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⢾⣿⠀⠀⢸⢻⠚⠀⠀⠀⢘⠀⠻⠀⠀⠀⣰⣧⣷⡄⠘⡶⠉⠁⠐⡆⠻⠂⠆⠒⠀⠎⣷⠀⠀⠀⠀⠀⠀⠀⠀⠀⡎⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⢀⡀⠀⠀⡀⠀⠀⣹⣿⠤⠖⢁⠼⠀⠀⠠⠤⠠⠤⠯⡄⢀⡴⣃⠀⠀⠸⠒⠃⢰⡇⠠⠅⠘⠀⡇⠠⠆⠩⠿⠄⢿⠗⠦⠚⠀⢾⠟⠈⢧⡴⠄⠀⠠⣤⠀⠀⠀⠀⠀
"""
print(random_ascii_art_from_the_internet + "\n\n\n")
for idx, title in enumerate(titles_to_display):
print(f"{idx}: {title}\n")
Full Sample Code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
# Configuring Browser
#---------------------------------------------------------
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_experimental_option("detach", True)
chrome_options.add_argument("--log-level=3")
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
browser = webdriver.Chrome(options=chrome_options)
browser.set_window_size(1080, 960)
# FILL IN THE BLANKS
#---------------------------------------------------------
def find_subreddit(subreddit):
"""Game Plan:
- Navigates to Reddit
- Searches for the subreddit
- Clicks on link to subreddit
Args:
subreddit (str): the subreddit to be visited
"""
# Navigate to reddit
reddit_url = "https://www.reddit.com"
browser.get(reddit_url)
# Search for subreddit using searchbar
searchbar = browser.find_element(By.NAME, "q")
searchbar.send_keys("Beans")
searchbar.click()
searchbar.send_keys(Keys.RETURN)
# Click subreddit link
time.sleep(1)
subreddit_link = browser.find_element(By.CLASS_NAME, "_1Nla8vW02K39sy0E826Iug")
subreddit_link.click()
def get_titles():
"""Game Plan:
- Choose how you want to find the title elements
- e.g by class name, tag name, xpath, etc
- Use browser.find_elements(.........)
- Convert each element in the list into text
Returns:
titles (list): a list of titles of posts found in the subreddit
"""
titles = []
# Get titles in raw format
raw_titles = browser.find_elements(By.CLASS_NAME, "_eYtD2XCVieq6emjKBH3m")
# Convert titles (which are of type 'WebElement') into their text
for title in raw_titles:
titles.append(title.text)
return titles
def display(titles_to_display):
"""Game Plan:
- Display your results in a cute format <3
Args:
titles_to_display (list): the titles to be displayed cutely
"""
titles_to_display = set(titles_to_display) # getting rid of duplicates!!
random_ascii_art_from_the_internet = """
⠀⣰⡶⣶⣶⡶⠤⠶⢿⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣿⣿⣿⢻⣧⣀⠀⠀⣿⣿⣿⣏⠷⣦⣀⡀⠀⠀⠀⣀⣀⣀⣄⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⢿⣿⠙⠻⣿⣿⢶⣄⠙⠻⠟⠋⠀⠀⠈⣙⣿⠛⠛⢻⣹⣥⣿⣫⠼⠋⠙⠛⣦⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠉⠀⠀⠹⠏⠛⢿⣿⢦⣄⡀⠤⢤⣤⡀⠙⢠⡀⠈⠻⣦⣼⠇⠀⠀⠀⢸⡇⣿⠻⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⣇⠈⠉⢛⡟⠙⠃⠀⠘⣧⣀⣀⣈⣉⣀⠀⠀⠀⢠⡇⢸⣇⣈⢷⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⠀⣠⣾⠃⠀⠀⠀⢰⡏⠁⠀⠀⠈⠙⢷⡄⠀⠈⠳⠞⠓⢮⡉⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⢀⣤⣴⣾⡿⠿⢿⣿⢿⣿⠟⠁⣀⣀⣠⡴⠋⠀⠀⠀⠀⠀⠀⠀⣷⠀⠀⠀⠀⠀⠀⠙⢻⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣰⣏⡿⠋⠁⢀⣠⢞⣡⠞⢁⣠⠞⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⢰⡟⠀⠀⠀⠀⠀⣀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠠⢿⣿⠁⠀⢰⡿⠛⠋⢁⣴⠟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣰⡟⠀⣀⣀⡀⠀⣾⠉⠉⢻⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠘⠿⠞⠛⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⡾⠋⠀⠀⣯⠀⠉⣻⣯⡶⢲⡞⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣰⠞⠋⠀⠀⠀⠀⣸⠆⠠⣇⠀⠀⣾⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⠾⠋⠀⠀⠀⠀⠀⠀⠀⠈⠓⠢⣬⢻⣾⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⡴⠛⠁⠀⢀⣀⣀⢀⣀⠀⠀⠀⠀⠀⠀⣸⡿⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⠞⠉⠀⠀⠀⠀⠘⣇⠈⠉⠉⢳⡄⠀⠀⢀⡼⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡴⠟⠁⠀⠀⠀⠀⠀⠀⣠⠾⢀⡾⢳⡀⢳⣄⡴⠛⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⣴⠟⠁⠀⠀⠀⠀⠀⠀⠀⠰⡏⠀⢿⡀⠈⣧⡾⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⣾⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠹⢦⣀⣿⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣠⣶⠶⢶⣶⡶⠦⣄⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣼⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⠟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣤⠶⠛⠉⠀⠙⠦⣄⠈⣹⡄⠀⠉⡽⠶⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢠⡟⠀⠀⠀⠀⠀⠀⢀⡖⠒⢦⣤⣰⡟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣤⠟⢳⣄⠀⠀⠀⠀⠀⣿⠀⠛⠛⠢⠞⠁⢀⣘⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢸⡇⠀⠀⠀⠀⠀⢀⣼⡇⢸⡖⣾⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣼⢯⣀⣸⠃⠀⠀⠀⠀⠀⣿⣠⠴⢦⣄⣀⡼⠋⠀⠘⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢸⠃⠀⠀⠀⠀⢠⠟⢿⣿⣩⣴⢿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣾⣁⠴⠟⠉⢳⡄⠀⠀⠀⣀⣈⠀⠀⠀⠈⠁⠀⠀⠀⠀⣿⠀⠀⠀⠀⠰⣶⣶⢤⣄⠀⠀⠀
⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠈⢧⣀⡭⠤⣿⢈⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣰⣟⠉⢁⡴⠒⠒⠚⢁⣤⠞⠋⠉⠉⠛⠳⣄⠀⠀⠀⣤⠖⢒⣿⠀⠀⠀⠀⠀⠀⠈⢧⡈⢳⡄⠀
⠀⠀⠀⠀⠀⢹⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⠉⠙⣧⣄⠀⠀⠀⠀⢀⣠⡾⠋⠈⠉⠁⠀⠀⠀⣰⠟⠀⠀⠀⠀⠀⠀⠀⠈⢷⠀⠀⣸⣦⣿⡏⠀⠀⠀⠀⠀⠀⠀⠈⣷⠀⢿⡀
⠀⠀⠀⠀⠀⢸⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀⢹⡦⢸⡇⢹⡙⠓⣶⠚⠋⣿⠀⠀⠀⠀⠀⠀⠀⣼⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⢤⠟⢁⣛⡾⠁⠀⠀⠀⠀⠀⠀⠀⠀⣼⢳⠈⣧
⠀⠀⠀⠀⠀⠀⢿⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⢣⠀⣱⠀⣸⠀⣠⠟⠀⠀⠀⠀⠀⠀⣼⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠈⠉⠉⢹⡇⠀⠀⠀⠀⠀⠀⠀⠀⢠⣏⣘⣧⣿
⠀⠀⠀⠀⠀⠀⠈⢿⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣞⠀⣿⣋⠁⣸⠃⠀⠀⠀⠀⠀⠀⣴⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢿⡄⠀⢀⠼⣧⡀⠀⠀⠀⠀⠀⠀⣠⠟⠁⠉⢀⡏
⠀⠀⠀⠀⠀⠀⠀⠀⠹⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠁⠀⠀⠀⠀⠀⢀⣴⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⣷⡀⠘⠒⠚⠻⣶⣤⣤⡤⠶⣿⠁⠀⠀⢀⡿⠁
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⢶⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⡞⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣄⠀⠀⠀⣧⡙⢻⡶⠚⠁⠀⢀⡴⠟⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠛⠲⢤⣤⣤⣀⣀⣀⣀⣀⣤⣤⠴⠛⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠛⠶⢤⣤⣿⣾⣥⣤⠶⠛⠋⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⠲⣶⠒⠲⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣧⡀⠀⠀⠀⠀⠀⢰⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠀⠀⠀⠀⣠⠞⠂⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⢾⣿⠀⠀⢸⢻⠚⠀⠀⠀⢘⠀⠻⠀⠀⠀⣰⣧⣷⡄⠘⡶⠉⠁⠐⡆⠻⠂⠆⠒⠀⠎⣷⠀⠀⠀⠀⠀⠀⠀⠀⠀⡎⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⢀⡀⠀⠀⡀⠀⠀⣹⣿⠤⠖⢁⠼⠀⠀⠠⠤⠠⠤⠯⡄⢀⡴⣃⠀⠀⠸⠒⠃⢰⡇⠠⠅⠘⠀⡇⠠⠆⠩⠿⠄⢿⠗⠦⠚⠀⢾⠟⠈⢧⡴⠄⠀⠠⣤⠀⠀⠀⠀⠀
"""
print(random_ascii_art_from_the_internet + "\n\n\n")
for idx, title in enumerate(titles_to_display):
print(f"{idx}: {title}\n")
# RUN IT
#---------------------------------------------------------
def run(subreddit):
"""Puts it all together!
N.B. Since our browser is a global variable we're not concerned
about having to pass it around function to function
Args:
subreddit (str): the subreddit we wish to scrape
"""
find_subreddit(subreddit)
time.sleep(2)
subreddit_titles = get_titles()
display(subreddit_titles)
# Uncomment when you're ready. Peer pressure is lame, so no rush <3
run("Beans")
Okay so who actually waits till the very end to start running their code?? Don't be afraid to try run your code even before it's fully-functional, just to see what's going on.
python scraper.py # or python3 scraper.py
Your final product should (fingers crossed) look a bit like this:
🤔Which method of finding an element should I use?
id -> name -> class name -> xpath
Great question! There's a hierarchy of element identification that we typically follow when trying to locate an element, and an element's id takes the number one spot. Wherever possible, try to use an element's id as it is unique to it and only it! In cases where that is not possible, next try name, then class name, and only if nothing else works should you go to xpath.
The great thing about xpath is it is almost always going to work... if the elements on a webpage do not move (i.e. they remain static). This is especially helpful for older, less responsive, websites. However, several modern-day websites move their elements all around whether it be for responsiveness or even sometimes to fight back bots! This is not to say that you should never use xpath, but instead for you to use it with due caution😅
🤔Why do I keep getting Chromedriver-related errors?
Trust me, I have been there and done that. This is usually because:
- You have not stored your chromedriver in the same folder as your code
- You accidentally downloaded the wrong chromedriver version
- Between now and the last time you ran your code a couple of days/weeks/months ago, your chromedriver got updated...
🤔Why can I not get past typing into the searchbar?
Hint, hint, NUDGE NUDGE NUDGE
# searchbar.click() # sometimes you need this👀
🤔Why don't I see anything when my program is running in the terminal?
Did you uncomment chrome_options.add_argument("--headless")
in your driver configurations👀. Tsk, tsk!
🤔Why do I keep getting "No Such Element Found" exceptions?
Trust me, this will not be the last time you come across these bad boys! There are typically two reasons why this happpens:
-
The element you are trying to find has a super complex/weirdly formatted id, class_name, name, etc. In this case, definitely try XPATH
-
Your code is going faster than your browser!
-
What this means is that sometimes your code is trying to move forward to the next step (e.g. finding an element) when your browser isn't even finished carrying out its current task (e.g. loading the page)
-
Here is when you can throw in a quick
time.sleep(1)
to make your code wait 1 second before trying to continue. Or, if you are up for the challenge, try using implcit or explicit wait times
-
🤔What does "DevTools listening on ...yada yada yada..." mean?
That your chromedriver (aka browser) is up and running! We love to see it😏
If you are actually reading this section, you are a nerd and I deeply appreciate you for it ❤️.
Now, what's the fun of scraping all these titles if they're just printed into the terminal and then... nothing! What if you want to do something else with them outside your program? Or track changes over time? Or do something else fun? The answer is simple: store it in a file! I, personally, am a fan of a good ole' .db file
.
N.B. - you can also choose to add the titles to the database as soon as they are found, instead of adding them all at the end. Your choice! Both come with pros and cons you can ask me about😝
Here is how you can create a db file to host all the titles you've found:
import sqlite3 # add this to the imports
....
....
def add_to_database(db_file_name, titles):
"""
Creates a new table in a .db file, if one doesn't already exist, to hold the information
found in the subreddit.
Args:
db_file_name (str): the name of the database file to open
titles (list of str): the list of titles to add to the database
"""
# create the database
conn = sqlite3.connect(db_file_name)
cursor = self.conn.cursor()
createTable = """CREATE TABLE IF NOT EXISTS
srinfo(id INTEGER PRIMARY KEY autoincrement, title TEXT)"""
cursor.execute(createTable)
# add to database
for title in titles:
cursor.execute("INSERT INTO {table_name} (title) VALUES(?)"
.format(table_name='srinfo'),(title,))
conn.commit()
- Python Selenium Documentation
- 84 Popular Sites on the Internet that you can scrape
- Figuring out whether or not you can scrape a site
I have a pip package you can download if you're interested in doing more subreddit scraping without all the code! To install:
pip install sreddit
For usage, and documentation, you can see my source code😊.
Footnotes
-
Source: https://chromedriver.chromium.org/ ↩
-
I say 'all' very loosely. What I really mean is all the ones on the page you see before dynamic-rendering kicks in and makes it a pain to scrape! So you'll get probably get around 10 titles. ↩