This repository contains a collection of Python scripts that are used to scrape, clean, and process food recipe data. The data is then used to generate a SQL seed file for initializing a database.
The project is structured as follows:
01-recipe_tags.py
: This script is used to extract tags from the raw recipe data.02-clean_tags.py
: This script is used to clean the extracted tags.03-recipe_cleaning.py
: This script is used to clean the raw recipe data.04-scrape_ingredients.py
: This script is used to scrape ingredient data from the web.05-match-ingredient-prices.py
: This script is used to match ingredient prices to the scraped ingredient data.99-generate_sql_seed.py
: This script is used to generate a SQL seed file from the processed data.
The food-com-recipes/
directory contains the raw recipe data.
The data/
directory contains various CSV and YAML files that are used as inputs and outputs by the scripts.
The experiments/
directory contains experimental scripts that were used during the initial dataset exploration.
- Run the scripts in the order of their numbering.
- The final output will be a SQL seed file (
data/seed.sql
) that can be used to initialize a database.
The scripts in this repository depend on several Python libraries, including Polars, BeautifulSoup, and Pint. The required libraries can be installed using the provided requirements.txt