This project demonstrates how to ingest data from the NewsAPI using the Data Load Tool (DLT) library. It fetches top headlines, article searches, and news sources, then loads them into a DuckDB database.
- Fetch top headlines from NewsAPI
- Search for articles on specific topics
- Retrieve news sources information
- Load data into DuckDB using DLT
- Jupyter notebook for interactive data exploration
- Python 3.11.8
- Pipenv for dependency management
-
Clone this repository:
git clone <repository-url> cd dlt-data-dumper
-
Install dependencies using Pipenv:
pipenv install
-
Activate the virtual environment:
pipenv shell
-
Set up your NewsAPI key:
- Sign up for a free API key at https://newsapi.org/
- Set the API key as an environment variable:
export NEWS_API_KEY=your_api_key_here
To run the data ingestion pipeline, use the newsapi_pipeline.py
script. This script supports various command-line options for different execution modes:
-
Normal mode:
python newsapi_pipeline.py
-
Test mode (uses DuckDB instead of filesystem):
python newsapi_pipeline.py --test
-
Full refresh (replaces existing data instead of appending):
python newsapi_pipeline.py --full-refresh
-
Custom log level:
python newsapi_pipeline.py --log-level DEBUG
You can also combine these options as needed. For example:
python newsapi_pipeline.py --test --full-refresh --log-level DEBUG
This script will fetch data from NewsAPI and load it into either a filesystem-based storage (default) or a DuckDB database (in test mode).
-
Start Jupyter Notebook:
jupyter notebook
-
Open the
eda-newsapi.ipynb
notebook in your browser. -
Run the cells to fetch data and perform exploratory data analysis.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open-source and available under the MIT License.