This template has been built after reading the Medium article by khuyetran1401. It would be much simpler to just fork its repo but I prefer to build it by myself to understand each component. It has been built to be easy and quick to use.
For 'industrial' or more 'business' projects, I still prefer tools like Kedro.
✅ Automatically build repository structure for DS personal projects
✅ Create and Build an environment using conda
🔲 Run Tests automatically
🔲 Manage configuration variables for data pipelines and projects
✅ Enforce hints and quality code
🔲 Automatically Document Code
🔲 Automate Code
✅ DVC for Data Management and Experiment Management
- Automate setup of dvc repo and .gitignore
- Conda: Package, dependency and environment management
- pre-commit: framework for managing and maintaining multi-language pre-commit hooks.
.
├── config # Project configuration files
│ ├──environment.yml # Environment file for conda
├── data # Local project data (not committed to version control)
│ ├── 01_raw # Raw immutable data
│ ├── 02_primary # Domain model data
│ ├── 03_feature # Model features
│ ├── 04_model_input # Often called 'master tables'
│ ├── 05_model_output # Data generated by model runs
│ ├── 06_reporting # Ad hoc descriptive cuts
├── docs # Project documentation
├── models # Project configuration files
├── notebooks # Project related Jupyter notebooks (used for experimental code before moving code to src)
├── README.md # Project README
└── src # Project source code
└── main.py
Install Cookiecutter:
pip install cookiecutter
Create a project based on the template:
cookiecutter https://github.com/radema/datascience-personal-templates
Activate the new environment
conda activate {{cookiecutter.environment_name}}
Execute setup in terminal
cd {{cookiecutter.repository-name}}; make setup