Skip to content

I'm crawling reddit website, and i want to store them in a database(postgresql maybe).

Notifications You must be signed in to change notification settings

cs-fedy/reddit-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reddit crawler:

I'm crawling reddit website, and i want to store them in a database(postgresql maybe). Also in this project i want to use technics related to database(IP Rotation, Real User Agent, Other Request Headers, Random Intervals In Between Requests and a Referrer).

P.S: docker is required

installation:

  1. clone the repo git clone https://github.com/cs-fedy/reddit-crawler
  2. run docker compose up -d to start the db.
  3. install virtualenv using pip: sudo pip install virtualenv
  4. create a new virtualenv: virtualenv venv
  5. activate the virtualenv: source venv/bin/activate
  6. install requirements: pip install requirements.txt
  7. run the script and enjoy: python scraper.py

used tools:

  1. selenium: Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should) also be automated as well.
  2. BeautifulSoup: Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
  3. python-dotenv: Add .env support to your django/flask apps in development and deployments.
  4. psycopg2: psycopg2 - Python-PostgreSQL Database Adapter.
  5. tabulate: Pretty-print tabular data.
  6. requests: Python HTTP for Humans.

Scraping tips:

  1. Do not follow the same crawling pattern: Incorporate some random clicks on the page, mouse movements and random actions that will make a spider look like a human.
  2. Make requests through Proxies and rotate them as needed: Create a pool of IPs that you can use and use random ones for each request. Along with this, you have to spread a handful of requests across multiple IPs. How to send requests through a Proxy in Python 3 using Requests.
  3. Rotate User Agents and corresponding HTTP Request Headers between requests. How to fake and rotate User Agents using Python 3 .
  4. Use a headless browser like Pyppeteer, Selenium or Playwright

Author:

created at 🌙 with 💻 and ❤ by f0ody

About

I'm crawling reddit website, and i want to store them in a database(postgresql maybe).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published