owczr · owczr · Mar 12, 2024 · Jan 8, 2024 · Jan 8, 2024 · Jan 8, 2024
diff --git a/.amlignore b/.amlignore
@@ -6,4 +6,5 @@ notebooks/
 docs/
 .pytest_cache/
 .github/
-
+logs/
+*.log
diff --git a/.gitignore b/.gitignore
@@ -4,4 +4,5 @@ test/
 LIDC-IDRI/
 .vscode/
 __pycache__/
-.env
+.env
+logs/
diff --git a/README.md b/README.md
@@ -1,19 +1,50 @@
 # Lung Cancer Detection
 
-## Table of Contents
-- [About](#about)
-- [Usage](#usage)
-- [License](#license)
+## Table Of Contents
+1. [About](#about)
+2. [Project Structure](#project-structure)
+3. [Usage](#usage)
+4. [License](#license)
 
 ## About
-Lung Cancer Detection is a project made as part of Engineers Thesis *"Applications of artificial intellingence in oncology on computer tomography dataset"* by **Jakub Owczarek**, under the guidance of Thesis Advisor dr. hab. inz **Mariusz Mlynarczuk** prof. AGH.
+Lung Cancer Detection is a project made as part of Engineers Thesis *"Applications of artificial intelligence in oncology on computer tomography dataset"* by **Jakub Owczarek**, under the guidance of Thesis Advisor dr. hab. inz **Mariusz Mlynarczuk** prof. AGH.
 <br>
 
-The goal of this projet is to process the [LIDC-IDRI](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=1966254) dataset and measure the performence of deep learning models pre-trained on Image Net by using transfer learning methods.
+The goal of this project is to process the [LIDC-IDRI](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=1966254) dataset and evaluate the performance of deep learning models pre-trained on Image Net by leveraging transfer learning.
 
-## Usage
+## Project Structure
+This repository contains the following directories:
 
-TODO: Fill in how to use this project locally and on Azure ML
+- *docs* - contains markdown files with more specific descriptions of the project components
+- *notebooks* - contains Jupyter Notebooks that were used for experiments, analysis, visualizations, etc
+- *scripts* - this directory is the actual workhorse and contains two notable subdirectories:
+
+  - *azure* - contains scripts for Azure Virtual Machine and Azure Machine Learning
+  - *local* - contains scripts that were used for local development
+
+- *src* - contains main components of the project:
+
+  - *azure* - contains utilities specific to Azure services
+  - *dataset* - contains `DatasetLoader` component used to feed data during model training
+  - *model* - contains model builder and director classes
+  - *preprocessing* - contains classes used for LIDC-IDRI dataset preprocessing
+  - *config.py* - some constants used throughout the project
+
+- *tests* - contains (few) tests for the project components
+
+## Usage 
+This project was created with Azure in mind and therefore the main scripts are meant for usage on Azure. 
+
+![usage_img](docs/assets/usage.png)
+
+### 1. Preprocessing
+1. First step is to download the LIDC-IDRI dataset on Azure Virtual Machine. The `azure/virtual_machine/download_dataset.sh` script is meant for this task.
+2. Then, it's time to preprocess this dataset to a format suitable for supervised deep learning model training. The `azure/virtual_machine/process_dataset.py` script is meant for this task. Additionally, in the same directory is `train_test_split.py`, which should be used to split processed data.
+3. Finally, the preprocessed dataset can be uploaded with the `upload_dataset_2.sh` script to Azure Blob Storage. There is also `upload_dataset.sh` script, but it doesn't use the `azcopy` utility and is too slow.
+
+### 2. Model training
+1. With preprocessed dataset on Azure Blob Storage, the Virtual Machine will be no longer necessary. From this dataset an Azure Machine Learning data asset can be created, which can be utilized during model training.
+2. Now to run the actual model training under `scripts/azure/machine_learing` is the `run_training_job.py` script. This script can be used to create a job on AML, to build, compile and train desired model. 
 
 ## License
 This project is licensed under the MIT License - see the LICENSE.md file for details
diff --git a/docs/assets/usage.png b/docs/assets/usage.png
diff --git a/notebooks/analysis/annotations.ipynb → notebooks/annotations.ipynb b/notebooks/analysis/annotations.ipynb → notebooks/annotations.ipynb
diff --git a/notebooks/analysis/check_preprocessed.ipynb → notebooks/check_preprocessed.ipynb b/notebooks/analysis/check_preprocessed.ipynb → notebooks/check_preprocessed.ipynb
diff --git a/notebooks/analysis/diagnosis.ipynb → notebooks/diagnosis.ipynb b/notebooks/analysis/diagnosis.ipynb → notebooks/diagnosis.ipynb
diff --git a/notebooks/analysis/dicom.ipynb → notebooks/dicom.ipynb b/notebooks/analysis/dicom.ipynb → notebooks/dicom.ipynb
diff --git a/notebooks/analysis/dicom_viewer.ipynb → notebooks/dicom_viewer.ipynb b/notebooks/analysis/dicom_viewer.ipynb → notebooks/dicom_viewer.ipynb
diff --git a/notebooks/analysis/metadata.ipynb → notebooks/metadata.ipynb b/notebooks/analysis/metadata.ipynb → notebooks/metadata.ipynb
diff --git a/notebooks/analysis/preprocessing.ipynb → notebooks/preprocessing.ipynb b/notebooks/analysis/preprocessing.ipynb → notebooks/preprocessing.ipynb
diff --git a/notebooks/results.ipynb b/notebooks/results.ipynb
diff --git a/notebooks/analysis/segmentation.ipynb → notebooks/segmentation.ipynb b/notebooks/analysis/segmentation.ipynb → notebooks/segmentation.ipynb
@@ -9,7 +9,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 148,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -28,9 +28,22 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 150,
+   "execution_count": 4,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "ename": "PermissionError",
+     "evalue": "[Errno 13] Permission denied: '/home/student/Repositories/lung-cancer-detection/LIDC-IDRI/CT/test/LIDC-IDRI-0001/01-01-2000-NA-NA-30178/3000566.000000-NA-03192/1-040.dcm'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mPermissionError\u001b[0m                           Traceback (most recent call last)",
+      "\u001b[1;32m/home/jakub/Repositories/lung-cancer-detection/notebooks/segmentation.ipynb Cell 4\u001b[0m line \u001b[0;36m2\n\u001b[1;32m      <a href='vscode-notebook-cell:/home/jakub/Repositories/lung-cancer-detection/notebooks/segmentation.ipynb#W3sZmlsZQ%3D%3D?line=0'>1</a>\u001b[0m dicom_path \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39m/home/student/Repositories/lung-cancer-detection/LIDC-IDRI/CT/test/LIDC-IDRI-0001/01-01-2000-NA-NA-30178/3000566.000000-NA-03192/1-040.dcm\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m----> <a href='vscode-notebook-cell:/home/jakub/Repositories/lung-cancer-detection/notebooks/segmentation.ipynb#W3sZmlsZQ%3D%3D?line=1'>2</a>\u001b[0m dcm \u001b[39m=\u001b[39m pydicom\u001b[39m.\u001b[39mdcmread(dicom_path)\n",
+      "File \u001b[0;32m~/.conda/envs/cancer/lib/python3.11/site-packages/pydicom/filereader.py:1002\u001b[0m, in \u001b[0;36mdcmread\u001b[0;34m(fp, defer_size, stop_before_pixels, force, specific_tags)\u001b[0m\n\u001b[1;32m   1000\u001b[0m     caller_owns_file \u001b[39m=\u001b[39m \u001b[39mFalse\u001b[39;00m\n\u001b[1;32m   1001\u001b[0m     logger\u001b[39m.\u001b[39mdebug(\u001b[39m\"\u001b[39m\u001b[39mReading file \u001b[39m\u001b[39m'\u001b[39m\u001b[39m{0}\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39mformat(fp))\n\u001b[0;32m-> 1002\u001b[0m     fp \u001b[39m=\u001b[39m \u001b[39mopen\u001b[39m(fp, \u001b[39m'\u001b[39m\u001b[39mrb\u001b[39m\u001b[39m'\u001b[39m)\n\u001b[1;32m   1003\u001b[0m \u001b[39melif\u001b[39;00m fp \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39mor\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mhasattr\u001b[39m(fp, \u001b[39m\"\u001b[39m\u001b[39mread\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39mor\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mhasattr\u001b[39m(fp, \u001b[39m\"\u001b[39m\u001b[39mseek\u001b[39m\u001b[39m\"\u001b[39m):\n\u001b[1;32m   1004\u001b[0m     \u001b[39mraise\u001b[39;00m \u001b[39mTypeError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mdcmread: Expected a file path or a file-like, \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m   1005\u001b[0m                     \u001b[39m\"\u001b[39m\u001b[39mbut got \u001b[39m\u001b[39m\"\u001b[39m \u001b[39m+\u001b[39m \u001b[39mtype\u001b[39m(fp)\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m)\n",
+      "\u001b[0;31mPermissionError\u001b[0m: [Errno 13] Permission denied: '/home/student/Repositories/lung-cancer-detection/LIDC-IDRI/CT/test/LIDC-IDRI-0001/01-01-2000-NA-NA-30178/3000566.000000-NA-03192/1-040.dcm'"
+     ]
+    }
+   ],
    "source": [
     "dicom_path = \"/home/student/Repositories/lung-cancer-detection/LIDC-IDRI/CT/test/LIDC-IDRI-0001/01-01-2000-NA-NA-30178/3000566.000000-NA-03192/1-040.dcm\"\n",
     "dcm = pydicom.dcmread(dicom_path)   "
@@ -45,7 +58,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 151,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -83,7 +96,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 200,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -112,7 +125,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 201,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -146,7 +159,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 202,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -162,7 +175,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 203,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -198,7 +211,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 204,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -235,7 +248,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 221,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -310,7 +323,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 222,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -348,7 +361,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 223,
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "steps += [segmented_lungs = image * mask]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -363,7 +385,12 @@
     }
    ],
    "source": [
-    "fig, axes = plt.subplots(nrows=1, ncols=len(steps), figsize=(20, 15))\n",
+    "from itertools import chain\n",
+    "\n",
+    "\n",
+    "fig, axes = plt.subplots(nrows=2, ncols=len(steps) // 2, figsize=(20, 15))\n",
+    "\n",
+    "axes = list(chain.from_iterable(axes))\n",
     "\n",
     "for step, ax in zip(steps, axes):\n",
     "    ax.imshow(step, cmap=\"bone\")"
@@ -379,7 +406,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 224,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {

diff --git a/scripts/azure/machine_learning/fine_tune.py b/scripts/azure/machine_learning/fine_tune.py
@@ -0,0 +1,150 @@
+import os
+import logging
+from datetime import datetime
+
+import click
+import mlflow
+import numpy as np
+import tensorflow as tf
+from azure.ai.ml.entities import Model
+from azure.ai.ml.constants import AssetTypes
+
+from src.model.director import ModelDirector
+from src.dataset.dataset_loader import DatasetLoader
+from src.config import (
+    RANDOM_SEED, 
+    EARLY_STOPPING_CONFIG, 
+    REDUCE_LR_CONFIG, 
+    MODELS, 
+    BUILDERS, 
+    CALLBACKS, 
+    METRICS,
+    config_logging
+)
+
+config_logging()
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger("azure")
+
+
+def get_compiled_model(model, optimizer, loss):
+    builder = BUILDERS[model]()
+
+    director = ModelDirector(builder)
+    model_nn = director.make()
+    logger.info(f"Built model_nn with {str(builder)}")
+
+    optimizer_cls = {
+        "adam": tf.keras.optimizers.Adam,
+        "sgd": tf.keras.optimizers.SGD,
+    }[optimizer]()
+
+    loss_cls = {
+        "binary_crossentropy": tf.keras.losses.BinaryCrossentropy,
+        "categorical_crossentropy": tf.keras.losses.CategoricalCrossentropy,
+    }[loss]()
+
+    metrics = [metric() for metric in METRICS]
+
+    model_nn.compile(optimizer=optimizer_cls, loss=loss_cls, metrics=metrics, run_eagerly=False)
+    logger.info("Compiled model")
+
+    return model_nn
+
+
+def get_compiled_distributed_model(model, optimizer, loss):
+    strategy = tf.distribute.MultiWorkerMirroredStrategy()
+
+    with strategy.scope():
+        model_nn = get_compiled_model(model, optimizer, loss)
+
+    return model_nn
+
+@click.command()
+@click.option(
+    "--model", type=click.Choice(MODELS), default="mobilenet", help="Model to train"
+)
+@click.option(
+    "--train", type=click.Path(exists=True), help="Path to the training dataset"
+)
+@click.option("--test", type=click.Path(exists=True), help="Path to the test dataset")
+@click.option(
+    "--optimizer",
+    type=click.Choice(["adam", "sgd"]),
+    default="adam",
+    help="Optimizer to use",
+)
+@click.option(
+    "--loss",
+    type=click.Choice(["binary_crossentropy", "categorical_crossentropy"]),
+    default="binary_crossentropy",
+    help="Loss function to use",
+)
+@click.option("--epochs", type=click.INT, default=10, help="Number of epochs to train for")
+@click.option("--batch_size", type=click.INT, default=64, help="Batch size for dataset loaders")
+@click.option("--job_name", type=click.STRING, help="Azure Machine Learning job name")
+@click.option("--distributed", is_flag=True, help="Use distributed startegy")
+def run(model, train, test, optimizer, loss, epochs, batch_size, job_name, distributed):
+    mlflow.set_experiment("lung-cancer-detection")
+    mlflow_run = mlflow.start_run(run_name=f"train_{model}_{datetime.now().strftime('%Y%m%d%H%M%S')}")
+
+    mlflow.log_param("optimizer", optimizer)
+    mlflow.log_param("loss", loss)
+    mlflow.log_param("epochs", epochs)
+    mlflow.log_param("batch_size", batch_size)
+    mlflow.log_param("random_seed", RANDOM_SEED)
+
+    logger.info(f"Started training run at {datetime.now()}")
+    logger.info(
+        f"Run parameters - optimizer: {optimizer}, loss: {loss}"
+    )
+
+    if not distributed:
+        model_nn = get_compiled_model(model, optimizer, loss)
+    else:
+        model_nn = get_compiled_distributed_model(model, optimizer, loss)
+
+    train_loader = DatasetLoader(train)
+    test_loader = DatasetLoader(test)
+
+    train_loader.set_seed(RANDOM_SEED)
+    test_loader.set_seed(RANDOM_SEED)
+
+    train_dataset = train_loader.get_dataset()
+    test_dataset = test_loader.get_dataset()
+    logger.info("Loaded train and test datasets")
+
+    history = model_nn.fit(train_dataset, epochs=epochs, callbacks=CALLBACKS)
+    logger.info("Trained model")
+
+    for metric, values in history.history.items():
+        for step, value in enumerate(values):
+            mlflow.log_metric(f"{metric}", value, step=step)
+
+    results = model_nn.evaluate(test_dataset, return_dict=True)
+    logger.info("Evaluated model")
+
+    for metric, value in results.items():
+        mlflow.log_metric(f"Final {metric}", value)
+
+    logger.info(f"Finished training at {datetime.now()}")
+
+    try:
+        mlflow.tensorflow.save_model(
+            model=model_nn,
+            path=os.path.join(job_name, model),
+        )
+    except TypeError as e:
+        logger.error(f"Saving model raised an error:\n{e}")
+
+    mlflow.tensorflow.log_model(
+        model=model_nn,
+        registered_model_name=model,
+        artifact_path=model,
+    )
+
+    mlflow.end_run()
+
+
+if __name__ == "__main__":
+    run()  # pylint: disable=no-value-for-parameter