A Web Scrapping project, which extracts multimedia from Startup Decks (https://startupdecks.co/decks/) and stores it in your local directory.
This is a web scrapping project, which is used to represent how information like text, multimedia and various hyperlinks from a webpage, irrespective of it's static or dynamic nature.
StartupDecks is the leading resource for startup pitch decks. See what leading companies like Airbnb, Facebook and more did to raise their rounds. Find out more at https://startupdecks.co/
This project is made using Python. The various Python libraries used for the same are mentioned below:
- Web Scrapping: BeautifulSoup 4, Selenium
- Driver Auto-Installation: chromedriver-autoinstaller
- HTML Parsing: html5lib
- URL Calling and Requests: urllib3, requests
- Encodings: webencodings, charset-normalizer
Since the project is built completely in Python, installing the project is pretty simple. Follow the steps below
- Open your Terminal.
- Change the current working directory to the location where you want the cloned directory.
- Run the command below to clone the project:
git clone https://github.com/hiferli/Startup-Scrapping.git
- After the cloning is done, use the following code to enter the current working directory
cd Startup-Scrapping
- You can execute the code after this step. However, to run the code with optimum dependancies, install
virtualenvironment
using the code below:
pip install virtualenv
- Following this, create a virtual environment using the code below
virtualenv Environment
- Run the code below to activate the environment
source Environment/Scripts/bin/activate
- Install the dependancies
pip install -r requirements.txt
- You're all set to go now.
To run the program:
- Open Command Prompt in the same directory as the project
- Run the command below:
python main.py
By default, running the script would return you Slide Captures of just 5 Startups. To increase/decrease the number of Startup's Slide Capture, do the following steps:
- Open the Directory of your clone repository.
- Go to the
DirectoryVariable.py
file. - You will find the
numberOfStartups
variable. - Change the value of the variable with any integer of your preference.
Special mention to @lizzzshan for helping me out in the logo color changes :)
Contributions are always welcome!
Please fork the project in your account and create a local repository of the same. Create pull requests for any Contributions. I'll be more than happy to add authors to the repository.
This is the first time I have been using Selenium and BeautifulSoup4. To get a glimpse of how things work over here and how to go through Web Scrapping, I have refered a few online tutorials.
To get a better understanding of the major techstacks involved while making this project, I have created some small projects. These mini-projects are mentioned below:
For Support/Feedback or any queries regarding the project, email [email protected]. I'll be more than happy to hear from you about this project and retrospect about the various parameters where I can improve the project.