The system will take a list of product URLs as input. Subsequently, it will expedite the data types for early parsing as specified. Following this, upon initiation of the scanning process, it will export the collected data in various file formats. Lastly, it can notify users by sending emails, which may also include the gathered data, attached to the email or embedded in the body.
- Beautiful Soup: A library used for scraping data from web pages.
- Requests: A library for making HTTP requests.
- asyncio: The core library for asynchronous programming.
- pandas: A library for data analysis and manipulation.
- tabulate: A library for creating tables.
- smtplib: A built-in library for sending emails using the Simple Mail Transfer Protocol.
- re: A built-in library for regular expressions.
- logging: A built-in library for flexible event logging.
- openpyxl: A library for reading and writing Excel files.
The project includes the following main components:
- src/crawler.py: Contains the Crawler class for asynchronously fetching web pages.
- src/parser.py: Includes URLParser classes for extracting product details from web pages.
- src/storage.py: Contains the StorageExporter class for storing obtained product information.
- src/email_sender.py: Includes the EmailSender class for handling email sending operations.
- product.py: Contains the
Product
class representing a product with its details. - logs/crawler.log:Log file containing information about the crawling process.
How to run this project :
Example Usage:
# Clone the repository
git clone https://github.com/your-username/your-web-scraper.git
# Navigate to the project directory
cd your-web-scraper
# Create a virtual environment
python -m venv venv
# Activate the virtual environment (Windows)
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the scraper
python.exe main.py
The project implements a concurrent approach by creating a product list, storing URLs in an array, and performing multiple data retrieval operations simultaneously.
Crawler.py effectively addresses erroneous URLs and various errors, providing a robust mechanism to overcome challenges.
The project maintains a log file (crawler.log) to record regular and consistent log information throughout the code.
Errors are handled gracefully, displaying error messages while allowing the project to continue running.
Data can be exported in JSON, XLSX, and CSV formats to ensure flexibility in data sharing and analysis.
Two distinct classes for N11 and Trendyol have been created, enabling data parsing according to the respective retailer's structure.
Export operations support JSON, XLSX, and CSV, providing versatility in exporting data files.
The email_sender.py module sends a notification email to the embedded email address once the data retrieval process is completed.
Exported files are attached to the notification email using the email_sender.py module.
The send_email
method in email_sender.py has been updated, and a new method _create_html_table
has been added to present collected data in a well-formatted HTML table within the email body.
Support for N11 and Trendyol has been extended to accommodate more than two retailers by creating classes for each retailer and parsing data accordingly.
Data export operations support JSON, CSV, and XLSX, providing a wider range of choices for file formats.
Prior to the web crawling process, the program includes a validation step to confirm that input URLs adhere to the standard HTTP or HTTPS format, preventing potential errors during execution.
-
Architectural Decisions:
- The project is organized into modules, including Crawler, URLParser, StorageExporter, and EmailSender. Each module has distinct responsibilities, promoting code readability and maintainability.
-
Pythonic Usages:
- Leveraging Python's features and standard libraries, the code follows Pythonic conventions. Asynchronous programming is implemented using the asyncio module.
-
Performance:
- Asynchronous programming enhances performance by allowing the scraper to download and process web pages concurrently. The asyncio module is employed for efficient asynchronous operations.
-
Manageability and Configurability:
- The code is designed to be extendable and configurable. Different URLParser classes can be added to extract data from various e-commerce websites. The Crawler class handles page downloading and processing, and the StorageExporter classes manage data export operations.
-
Asynchronous Programming Best Practices:
- The asyncio module is used for asynchronous programming, and await expressions are appropriately utilized within asynchronous functions.
-
Alignment with Requirements:
- The project meets the specified requirements, extracting product information from different e-commerce websites asynchronously. Asynchronous programming using asyncio aligns with the project's requirements.
-
Technical Documentation:
- The code includes comments and docstrings providing technical documentation. Each module and class has explanations for methods and functionalities, enhancing code understandability.
Python Documentation:
OpenAI GPT-3.5:
- OpenAI. "ChatGPT. Version 3.5." 2023. OpenAI. https://openai.com/.
Web Crawling and Scraping:
- TechTarget. "Crawler." https://www.techtarget.com/whatis/definition/crawler
- Analytics Vidhya. "Web Scraping with Python - Beginner to Advanced." https://medium.com/analytics-vidhya/web-scraping-with-python-beginner-to-advanced-10daaca021f3
- DataCamp. "Web Scraping with Python." https://www.datacamp.com/tutorial/web-scraping-python-nlp
Selenium with Python:
- Barış Teksin. "Selenium Nedir? Python ile Selenium Kullanımı." https://www.baristeksin.com.tr/selenium-nedir-python-ile-selenium-kullanimi/