update code structure

SkafteNicki · Nov 10, 2023 · f2070da · f2070da
1 parent 5133db3
commit f2070da
Show file tree

Hide file tree

Showing 2 changed files with 118 additions and 100 deletions.
diff --git a/figures/cookie_cutter.png b/figures/cookie_cutter.png
diff --git a/s2_organisation_and_version_control/code_structure.md b/s2_organisation_and_version_control/code_structure.md
@@ -34,55 +34,57 @@ codebase should probably be different.
 
 ## Cookiecutter
 
-We are in this course going to use the tool [cookiecutter](https://cookiecutter.readthedocs.io/en/latest/README.html), 
-which is tool for creating projects from *project templates*. A project template is in short ust a overall structure of 
-how you want your folders, files etc. to be organised from the beginning. In particular for this course we are going to 
-be using the [cookiecutter data science template](https://github.com/drivendata/cookiecutter-data-science). We are not 
-going to argue that this template is better than everyother template, we are just focusing that it is a **standardized** 
-way of creating project structures for data science projects. By standardized we mean, that if two persons are both 
-using `cookiecutter` with the same template, the layout of their code does follow some specific rules, making one able 
-to faster get understand the other persons code. Code organization is therefore not only to make the code easier for 
-you to maintain but also for others to read and understand.
+We are in this course going to use the tool [cookiecutter](https://cookiecutter.readthedocs.io/en/latest/README.html),
+which is tool for creating projects from *project templates*. A project template is in short ust a overall structure of
+how you want your folders, files etc. to be organised from the beginning. For this course we are going to be using a
+custom [MLOps template](https://github.com/SkafteNicki/mlops_template). The template is essentially a fork of the
+[cookiecutter data science template](https://github.com/drivendata/cookiecutter-data-science) template that has been
+used for a couple of years in the course, but specialized a bit more towards MLOps instead of general data science.
+
+We are not going to argue that this template is better than everyother template, we are just focusing that it is a
+**standardized** way of creating project structures for machine learning projects. By standardized we mean, that if two
+persons are both using `cookiecutter` with the same template, the layout of their code does follow some specific rules,
+making one able to faster get understand the other persons code. Code organization is therefore not only to make the
+code easier for you to maintain but also for others to read and understand.
 
 Below is seen the default code structure of cookiecutter for data science projects.
 
 <figure markdown>
-  ![Image](../figures/cookie_cutter.png){ width="1000" }
-  <figcaption> <a href="https://github.com/drivendata/cookiecutter-data-science"> Image credit </a> </figcaption>
+![Image](../figures/cookie_cutter.png){ width="1000" }
 </figure>
 
-What is important to keep in mind when using a template, is that it exactly is a template. By
-definition a template is *guide* to make something. Therefore, not all parts of an template may be important for your
-project at hand. Your job is to pick the parts from the template that is useful for organizing your data science.
-project.
+What is important to keep in mind when using a template, is that it exactly is a template. By definition a template is
+*guide* to make something. Therefore, not all parts of an template may be important for your project at hand. Your job
+is to pick the parts from the template that is useful for organizing your machine learning project and add the parts
+that are missing.
 
-## Making a python package
+## Python projects
 
-Before we get started with the exercises, there is another topic that is important to discuss and that is how 
-to create python packages. Python is the dominant language for machine learning and data science currently. The
-reason we want to have some itu
+While the same template in principal could be used regardless of what language we where using for our machine learning
+or data science application, there are certain considerations to take into account based on what language we are using.
+Python is the dominant language for machine learning and data science currently, which is why we in this section is
+focusing on some of the special files you will need for your Python projects.
 
-
-The hole idea
-
-
-
-Whenever you run `pip install`, `pip` is in charge of both downloading the package you want but also
-in charge of *installing* it. For `pip` to be able to install a package it needs instructions on what part of the code
-it should install. The first file you should have encount
-
-In Python the `__init__.py` file is used to mark a directory as a Python package. Therefore as a bare minimum a python
+The first file you may or may not know is the `__init__.py` file. In Python the `__init__.py` file is used to mark a
+directory as a Python package. Therefore as a bare minimum, any Python package should look something like this:
 package should look something like this
 
 ```txt
 ├── src
 │   ├── __init__.py
 │   ├── file1.py
-│   ├── file2.py 
+│   ├── file2.py
 ├── pyproject.toml
 ```
 
-We are not going to create just a selection of script and hope that they talk to each others.
+The second file to focus on is the `pyproject.toml`. This file is important for actually converting your code into a
+Python project. Essentially, whenever you run `pip install`, `pip` is in charge of both downloading the package you want
+but also in charge of *installing* it. For `pip` to be able to install a package it needs instructions on what part of
+the code it should install and how to install it. This is the job of the `pyproject.toml` file.
+
+Below we have both added a description of the structure of the `pyproject.toml` file but also `setup.py + setup.cfg`
+which is the "old" way of providing project instructions regarding Python project. However, you may still encounter
+a lot of projects using `setup.py + setup.cfg` so it is good to at least know about them.
 
 === "pyproject.toml"
 
@@ -100,37 +102,69 @@ We are not going to create just a selection of script and hope that they talk to
     version = "0.1.0"
     authors = [{name = "EM", email = "[email protected]"}]
     description = "Something cool here."
+    requires-python = ">=3.8"
+    dynamic = ["dependencies"]
+
+    [tool.setuptools.dynamic]
+    dependencies = {file = ["requirements.txt"]}
     ```
 
-    the `[build-section]` informs `pip`/`python` that to build this python project it needs the two packages
-    `setuptools` and `wheels` and that it should call the `setuptools.build_meta` function to actually build the
-    project. The `[project]` section essentially tells what the 
+    the `[build-section]` informs `pip`/`python` that to build this Python project it needs the two packages
+    `setuptools` and `wheels` and that it should call the
+    [setuptools.build_meta](https://setuptools.pypa.io/en/latest/build_meta.html) function to actually build the
+    project. The `[project]` section essentially contains metadata regarding the package, what its called etc. if we
+    ever want to publish it to [PyPI](https://pypi.org/).
 
+    For specifying dependencies of your project you have two options. Either you specify them in a `requirements.txt`
+    file and it as a dynamic field in `pyproject.toml` as shown above. Alternatively, you can add a `dependencies` field
+    under the `[project]` header like this:
 
-    if you want to be compatible with the old way of doing package in python, you can simply add a file called
-    `setup.py` that includes the following code
+    ```toml
+    [project]
+    dependencies = [
+        'torch==2.1.0',
+        'matplotlib>=3.8.1'
+    ]
+    ```
 
-    ```python
-    from setuptools import setup
-    setup()
+    The improvement over `setup.py + setup.cfg` is that `pyproject.toml` also allows for metadata from other tools to
+    be specified in it, essentially making sure you only need a single file for your project. For example, in the next
+    [module M7 on good coding practices] you will learn about the tool `ruff` and how it can help format your code. If
+    we want to configure `ruff` for our project we can do that directly in `pyproject.toml` by adding additional
+    headers:
+
+    ```toml
+    [ruff]
+    ruff_option = ...
     ```
 
+    To read more about how specify `pyproject.toml` this
+    [page](https://packaging.python.org/en/latest/specifications/declaring-project-metadata/#declaring-project-metadata)
+    is a good place to start.
+
 === "setup.py + setup.cfg"
 
-    `setup.py` is the original way to describe
+    `setup.py` is the original way to describing how a Python package should be build. The most basic `setup.py` file
+    will look like this:
 
     ```python
     from setuptools import setup
-
+    from pip.req import parse_requirements
+    requirements = [str(ir.req) for ir in parse_requirements("requirements.txt")]
     setup(
         name="my-package-name",
         version="0.1.0",
         author="EM",
         description="Something cool here."
-        # ...
+        install_requires=requirements,
     )
     ```
 
+    Essentially, the it is the exact same meta information as in `pyproject.toml`, just written directly in Python
+    syntax instead of `toml`. Because there was a wish to deperate this meta information into a separate file, the
+    `setup.cfg` file was created which can contain the exact same information as `setup.py` just in a declarative
+    config.
+
     ```toml
     [metadata]
     name = my-package-name
@@ -140,11 +174,11 @@ We are not going to create just a selection of script and hope that they talk to
     # ...
     ```
 
-https://setuptools.pypa.io/en/latest/build_meta.html
+    This non-standardized way of providing meta information regarding a package was essentially what lead to the
+    creation of `pyproject.toml`.
 
-
-Regardless of what way you one chooses to go around the question of build systems and meta data, after creating the
-above files the correct way to install them would be the same
+Regardless of what way a project is configured, after creating the above files the correct way to install them would be
+the same
 
 ```bash
 pip install .
@@ -158,8 +192,10 @@ pip install -e . # (1)!
     to run `pip install` every time we make a change. Essentially, in developer mode changes in the Python source code
     can immediately take place without requiring a new installation.
 
+after running this your code should be available to import as `from src import ...` like any other Python package you
+use. This is the most essential you need to know about creating Python packages.
 
-### ❔ Exercises
+## ❔ Exercises
 
 After having installed cookiecutter (exercise 1 and 2), the remaining exercises are intended to be used on taking the
 simple CNN MNIST classifier from yesterdays exercise and force it into this structure. You are not required to fill out
@@ -174,43 +210,35 @@ etc...
 
 in this way paths (for saving and loading files) are always relative to the root.
 
-1. Start by reading [this page](https://drivendata.github.io/cookiecutter-data-science/), as it will give you insight
-    to why standardized code organization is important.
-
-2. Install [cookie cutter for data science](https://github.com/drivendata/cookiecutter-data-science)
+1. Install [cookiecutter](https://cookiecutter.readthedocs.io/en/stable/) framework
 
     ``` bash
-    # install using the terminal
     pip install cookiecutter
     ```
 
-3. Take a look at the webpage to see how you start a new project. We recommend using `v2` of cookiecutter.
+2. Start a new project using [this template](https://github.com/SkafteNicki/mlops_template), that is specialized for
+    this course (1).
+    { .annotate }
 
-4. After having created your project we are going to install it as a package in our conda environment. Either run
+    1. If you feel like the template can be improve in some way, feel free to either open a issue with the proposed
+        improvement or directly send a pull request to the repository 😄.
 
-    ```bash
-    # install in a terminal in your conda env
-    pip install -e .
-    # or
-    conda develop .
-    ```
-
-    In addition you may need to run
+3. After having created your new project, the first step is to also create a corresponding virtual environment and
+    install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create
+    a new. Then install the project in that environment
 
     ```bash
-    pip install -r requirements.txt
+    pip install -e .
     ```
 
-    to install additional packages required by `cookie-cutter`.
-
-5. Start by filling out the `src/data/make_dataset.py` file. When this file runs, it should take the raw data files in
+4. Start by filling out the `src/data/make_dataset.py` file. When this file runs, it should take the raw data files in
     `data/raw` (the files that we have provided) process them into a single tensor, normalize the tensor and save this
     intermediate representation to the `data/processed` folder. By normalization here we refer to making sure the
     images have mean 0 and standard deviation 1.
 
-6. Every `cookie-cutter` project comes with a build in `Makefile` that can be used to easily define common operations in
-    a project. You do not have to understand the complete file by try taking a look at it. In particular the following
-    commands may come in handy
+5. This template comes with a `Makefile` that can be used to easily define common operations in a project. You do not
+    have to understand the complete file but try taking a look at it. In particular the following commands may come in
+    handy
 
     ```bash
     make data  # runs the make_dataset.py file, try it!
@@ -224,23 +252,27 @@ in this way paths (for saving and loading files) are always relative to the root
         it running on Windows. The first is leveraging
         [linux subsystem](https://docs.microsoft.com/en-us/windows/wsl/install-win10) for Windows which you maybe have
         already installed. The second option is utilizing the [chocolatey](https://chocolatey.org/) package manager,
-        which enables Windows users to install packages similar to Linux system. The second option is running
+        which enables Windows users to install packages similar to Linux system.
+
+    In general we recommend that you add commands to the `Makefile` as you move along in the course. If you want to know
+    more about how to write `Makefile`s then this is an excellent
+    [video](https://youtu.be/F6DZdvbRZQQ?si=9qg-XUva-l-9Tl21).
 
-7. Put your model file (`model.py`) into `src/models` folder together and insert the relevant code from the `main.py`
+6. Put your model file (`model.py`) into `src/models` folder together and insert the relevant code from the `main.py`
     file into the `train_model.py` file. Make sure that whenever a model is trained and it is saved, that it gets saved
     to the `models` folder (preferably in sub-folders).
 
-8. When you run `train_model.py`, make sure that some statistics/visualizations from the trained models gets saved to
+7. When you run `train_model.py`, make sure that some statistics/visualizations from the trained models gets saved to
     the `reports/figures/` folder. This could be a simple `.png` of the training curve.
 
-9. (Optional) Can you figure out a way to add a `train` command to the `Makefile` such that training can be started
+8. (Optional) Can you figure out a way to add a `train` command to the `Makefile` such that training can be started
     using
 
     ```bash
     make train
     ```
 
-10. Fill out the newly created `src/models/predict_model.py` file, such that it takes a pre-trained model file and
+9. Fill out the newly created `src/models/predict_model.py` file, such that it takes a pre-trained model file and
     creates prediction for some data. Recommended interface is that users can give this file either a folder with raw
     images that gets loaded in or a `numpy` or `pickle` file with already loaded images e.g. something like this
 
@@ -250,7 +282,7 @@ in this way paths (for saving and loading files) are always relative to the root
         data/example_images.npy  # file containing just 10 images for prediction
     ```
 
-11. Fill out the file `src/visualization/visualize.py` with this (as minimum, feel free to add more visualizations)
+10. Fill out the file `src/visualization/visualize.py` with this (as minimum, feel free to add more visualizations)
     * Loads a pre-trained network
     * Extracts some intermediate representation of the data (your training set) from your cnn. This could be the
         features just before the final classification layer
@@ -259,32 +291,18 @@ in this way paths (for saving and loading files) are always relative to the root
         reduction.
     * Save the visualization to a file in the `reports/figures/` folder.
 
-12. (Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)
+11. (Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)
 
-13. Make sure to update the `README.md` file with a short description on how your scripts should be run
+12. Make sure to update the `README.md` file with a short description on how your scripts should be run
 
-14. Finally make sure to update the `requirements.txt` file with any packages that are necessary for running your
+13. Finally make sure to update the `requirements.txt` file with any packages that are necessary for running your
     code (see [this set of exercises](../s1_development_environment/package_manager.md) for help)
 
-## 🧠 Knowledge check
-
-??? question "Knowledge question 1"
-
-    If tensor `a` has shape `[N, d]` and tensor `b` has shape `[M, d]` how can we calculate the pairwise distance
-    between rows in `a` and `b` without using a for loop?
-
-    ??? success "Solution"
-
-        We can take advantage of [broadcasting](https://pytorch.org/docs/stable/notes/broadcasting.html) to do this
-
-
-
-
-That ends the module on code structure and `cookiecutter`. We again want to stress the point that `cookiecutter` is
-just one template for organizing your code. What often happens in a team is that multiple templates are needed in
-different stages of the development phase or for different product types because they share common structure, while
-still having some specifics. Keeping templates up-to-date then becomes critical such that no team member is using an
-outdated template. If you ever end up in this situation, we highly recommend to checkout
-[cruft](https://github.com/cruft/cruft) that works alongside `cookiecutter` to not only make projects but update
-existing ones as template evolves. Cruft additionally also has template validation capabilities to ensure projects
-match the latest version of a template.
+That ends the module on code structure and `cookiecutter`. We again want to stress the point of using `cookiecutter`
+is not about following one specific template, but instead just to use any template for organizing your code. What often
+happens in a team is that multiple templates are needed in different stages of the development phase or for different
+product types because they share common structure, while still having some specifics. Keeping templates up-to-date then
+becomes critical such that no team member is using an outdated template. If you ever end up in this situation, we highly
+recommend to checkout [cruft](https://github.com/cruft/cruft) that works alongside `cookiecutter` to not only make
+projects but update existing ones as template evolves. Cruft additionally also has template validation capabilities to
+ensure projects match the latest version of a template.