Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates and small improvements #6

Merged
merged 9 commits into from
Mar 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .amlignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ notebooks/
docs/
.pytest_cache/
.github/

logs/
*.log
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ test/
LIDC-IDRI/
.vscode/
__pycache__/
.env
.env
logs/
47 changes: 39 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,50 @@
# Lung Cancer Detection

## Table of Contents
- [About](#about)
- [Usage](#usage)
- [License](#license)
## Table Of Contents
1. [About](#about)
2. [Project Structure](#project-structure)
3. [Usage](#usage)
4. [License](#license)

## About
Lung Cancer Detection is a project made as part of Engineers Thesis *"Applications of artificial intellingence in oncology on computer tomography dataset"* by **Jakub Owczarek**, under the guidance of Thesis Advisor dr. hab. inz **Mariusz Mlynarczuk** prof. AGH.
Lung Cancer Detection is a project made as part of Engineers Thesis *"Applications of artificial intelligence in oncology on computer tomography dataset"* by **Jakub Owczarek**, under the guidance of Thesis Advisor dr. hab. inz **Mariusz Mlynarczuk** prof. AGH.
<br>

The goal of this projet is to process the [LIDC-IDRI](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=1966254) dataset and measure the performence of deep learning models pre-trained on Image Net by using transfer learning methods.
The goal of this project is to process the [LIDC-IDRI](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=1966254) dataset and evaluate the performance of deep learning models pre-trained on Image Net by leveraging transfer learning.

## Usage
## Project Structure
This repository contains the following directories:

TODO: Fill in how to use this project locally and on Azure ML
- *docs* - contains markdown files with more specific descriptions of the project components
- *notebooks* - contains Jupyter Notebooks that were used for experiments, analysis, visualizations, etc
- *scripts* - this directory is the actual workhorse and contains two notable subdirectories:

- *azure* - contains scripts for Azure Virtual Machine and Azure Machine Learning
- *local* - contains scripts that were used for local development

- *src* - contains main components of the project:

- *azure* - contains utilities specific to Azure services
- *dataset* - contains `DatasetLoader` component used to feed data during model training
- *model* - contains model builder and director classes
- *preprocessing* - contains classes used for LIDC-IDRI dataset preprocessing
- *config.py* - some constants used throughout the project

- *tests* - contains (few) tests for the project components

## Usage
This project was created with Azure in mind and therefore the main scripts are meant for usage on Azure.

![usage_img](docs/assets/usage.png)

### 1. Preprocessing
1. First step is to download the LIDC-IDRI dataset on Azure Virtual Machine. The `azure/virtual_machine/download_dataset.sh` script is meant for this task.
2. Then, it's time to preprocess this dataset to a format suitable for supervised deep learning model training. The `azure/virtual_machine/process_dataset.py` script is meant for this task. Additionally, in the same directory is `train_test_split.py`, which should be used to split processed data.
3. Finally, the preprocessed dataset can be uploaded with the `upload_dataset_2.sh` script to Azure Blob Storage. There is also `upload_dataset.sh` script, but it doesn't use the `azcopy` utility and is too slow.

### 2. Model training
1. With preprocessed dataset on Azure Blob Storage, the Virtual Machine will be no longer necessary. From this dataset an Azure Machine Learning data asset can be created, which can be utilized during model training.
2. Now to run the actual model training under `scripts/azure/machine_learing` is the `run_training_job.py` script. This script can be used to create a job on AML, to build, compile and train desired model.

## License
This project is licensed under the MIT License - see the LICENSE.md file for details
Binary file added docs/assets/usage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
342 changes: 342 additions & 0 deletions notebooks/results.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
},
{
"cell_type": "code",
"execution_count": 148,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -28,9 +28,22 @@
},
{
"cell_type": "code",
"execution_count": 150,
"execution_count": 4,
"metadata": {},
"outputs": [],
"outputs": [
{
"ename": "PermissionError",
"evalue": "[Errno 13] Permission denied: '/home/student/Repositories/lung-cancer-detection/LIDC-IDRI/CT/test/LIDC-IDRI-0001/01-01-2000-NA-NA-30178/3000566.000000-NA-03192/1-040.dcm'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mPermissionError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m/home/jakub/Repositories/lung-cancer-detection/notebooks/segmentation.ipynb Cell 4\u001b[0m line \u001b[0;36m2\n\u001b[1;32m <a href='vscode-notebook-cell:/home/jakub/Repositories/lung-cancer-detection/notebooks/segmentation.ipynb#W3sZmlsZQ%3D%3D?line=0'>1</a>\u001b[0m dicom_path \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39m/home/student/Repositories/lung-cancer-detection/LIDC-IDRI/CT/test/LIDC-IDRI-0001/01-01-2000-NA-NA-30178/3000566.000000-NA-03192/1-040.dcm\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m----> <a href='vscode-notebook-cell:/home/jakub/Repositories/lung-cancer-detection/notebooks/segmentation.ipynb#W3sZmlsZQ%3D%3D?line=1'>2</a>\u001b[0m dcm \u001b[39m=\u001b[39m pydicom\u001b[39m.\u001b[39mdcmread(dicom_path)\n",
"File \u001b[0;32m~/.conda/envs/cancer/lib/python3.11/site-packages/pydicom/filereader.py:1002\u001b[0m, in \u001b[0;36mdcmread\u001b[0;34m(fp, defer_size, stop_before_pixels, force, specific_tags)\u001b[0m\n\u001b[1;32m 1000\u001b[0m caller_owns_file \u001b[39m=\u001b[39m \u001b[39mFalse\u001b[39;00m\n\u001b[1;32m 1001\u001b[0m logger\u001b[39m.\u001b[39mdebug(\u001b[39m\"\u001b[39m\u001b[39mReading file \u001b[39m\u001b[39m'\u001b[39m\u001b[39m{0}\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39mformat(fp))\n\u001b[0;32m-> 1002\u001b[0m fp \u001b[39m=\u001b[39m \u001b[39mopen\u001b[39m(fp, \u001b[39m'\u001b[39m\u001b[39mrb\u001b[39m\u001b[39m'\u001b[39m)\n\u001b[1;32m 1003\u001b[0m \u001b[39melif\u001b[39;00m fp \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39mor\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mhasattr\u001b[39m(fp, \u001b[39m\"\u001b[39m\u001b[39mread\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39mor\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mhasattr\u001b[39m(fp, \u001b[39m\"\u001b[39m\u001b[39mseek\u001b[39m\u001b[39m\"\u001b[39m):\n\u001b[1;32m 1004\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mTypeError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mdcmread: Expected a file path or a file-like, \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 1005\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mbut got \u001b[39m\u001b[39m\"\u001b[39m \u001b[39m+\u001b[39m \u001b[39mtype\u001b[39m(fp)\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m)\n",
"\u001b[0;31mPermissionError\u001b[0m: [Errno 13] Permission denied: '/home/student/Repositories/lung-cancer-detection/LIDC-IDRI/CT/test/LIDC-IDRI-0001/01-01-2000-NA-NA-30178/3000566.000000-NA-03192/1-040.dcm'"
]
}
],
"source": [
"dicom_path = \"/home/student/Repositories/lung-cancer-detection/LIDC-IDRI/CT/test/LIDC-IDRI-0001/01-01-2000-NA-NA-30178/3000566.000000-NA-03192/1-040.dcm\"\n",
"dcm = pydicom.dcmread(dicom_path) "
Expand All @@ -45,7 +58,7 @@
},
{
"cell_type": "code",
"execution_count": 151,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -83,7 +96,7 @@
},
{
"cell_type": "code",
"execution_count": 200,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -112,7 +125,7 @@
},
{
"cell_type": "code",
"execution_count": 201,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -146,7 +159,7 @@
},
{
"cell_type": "code",
"execution_count": 202,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -162,7 +175,7 @@
},
{
"cell_type": "code",
"execution_count": 203,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -198,7 +211,7 @@
},
{
"cell_type": "code",
"execution_count": 204,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -235,7 +248,7 @@
},
{
"cell_type": "code",
"execution_count": 221,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -310,7 +323,7 @@
},
{
"cell_type": "code",
"execution_count": 222,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -348,7 +361,16 @@
},
{
"cell_type": "code",
"execution_count": 223,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"steps += [segmented_lungs = image * mask]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -363,7 +385,12 @@
}
],
"source": [
"fig, axes = plt.subplots(nrows=1, ncols=len(steps), figsize=(20, 15))\n",
"from itertools import chain\n",
"\n",
"\n",
"fig, axes = plt.subplots(nrows=2, ncols=len(steps) // 2, figsize=(20, 15))\n",
"\n",
"axes = list(chain.from_iterable(axes))\n",
"\n",
"for step, ax in zip(steps, axes):\n",
" ax.imshow(step, cmap=\"bone\")"
Expand All @@ -379,7 +406,7 @@
},
{
"cell_type": "code",
"execution_count": 224,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down
150 changes: 150 additions & 0 deletions scripts/azure/machine_learning/fine_tune.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
import os
import logging
from datetime import datetime

import click
import mlflow
import numpy as np
import tensorflow as tf
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

from src.model.director import ModelDirector
from src.dataset.dataset_loader import DatasetLoader
from src.config import (
RANDOM_SEED,
EARLY_STOPPING_CONFIG,
REDUCE_LR_CONFIG,
MODELS,
BUILDERS,
CALLBACKS,
METRICS,
config_logging
)

config_logging()
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("azure")


def get_compiled_model(model, optimizer, loss):
builder = BUILDERS[model]()

director = ModelDirector(builder)
model_nn = director.make()
logger.info(f"Built model_nn with {str(builder)}")

optimizer_cls = {
"adam": tf.keras.optimizers.Adam,
"sgd": tf.keras.optimizers.SGD,
}[optimizer]()

loss_cls = {
"binary_crossentropy": tf.keras.losses.BinaryCrossentropy,
"categorical_crossentropy": tf.keras.losses.CategoricalCrossentropy,
}[loss]()

metrics = [metric() for metric in METRICS]

model_nn.compile(optimizer=optimizer_cls, loss=loss_cls, metrics=metrics, run_eagerly=False)
logger.info("Compiled model")

return model_nn


def get_compiled_distributed_model(model, optimizer, loss):
strategy = tf.distribute.MultiWorkerMirroredStrategy()

with strategy.scope():
model_nn = get_compiled_model(model, optimizer, loss)

return model_nn

@click.command()
@click.option(
"--model", type=click.Choice(MODELS), default="mobilenet", help="Model to train"
)
@click.option(
"--train", type=click.Path(exists=True), help="Path to the training dataset"
)
@click.option("--test", type=click.Path(exists=True), help="Path to the test dataset")
@click.option(
"--optimizer",
type=click.Choice(["adam", "sgd"]),
default="adam",
help="Optimizer to use",
)
@click.option(
"--loss",
type=click.Choice(["binary_crossentropy", "categorical_crossentropy"]),
default="binary_crossentropy",
help="Loss function to use",
)
@click.option("--epochs", type=click.INT, default=10, help="Number of epochs to train for")
@click.option("--batch_size", type=click.INT, default=64, help="Batch size for dataset loaders")
@click.option("--job_name", type=click.STRING, help="Azure Machine Learning job name")
@click.option("--distributed", is_flag=True, help="Use distributed startegy")
def run(model, train, test, optimizer, loss, epochs, batch_size, job_name, distributed):
mlflow.set_experiment("lung-cancer-detection")
mlflow_run = mlflow.start_run(run_name=f"train_{model}_{datetime.now().strftime('%Y%m%d%H%M%S')}")

mlflow.log_param("optimizer", optimizer)
mlflow.log_param("loss", loss)
mlflow.log_param("epochs", epochs)
mlflow.log_param("batch_size", batch_size)
mlflow.log_param("random_seed", RANDOM_SEED)

logger.info(f"Started training run at {datetime.now()}")
logger.info(
f"Run parameters - optimizer: {optimizer}, loss: {loss}"
)

if not distributed:
model_nn = get_compiled_model(model, optimizer, loss)
else:
model_nn = get_compiled_distributed_model(model, optimizer, loss)

train_loader = DatasetLoader(train)
test_loader = DatasetLoader(test)

train_loader.set_seed(RANDOM_SEED)
test_loader.set_seed(RANDOM_SEED)

train_dataset = train_loader.get_dataset()
test_dataset = test_loader.get_dataset()
logger.info("Loaded train and test datasets")

history = model_nn.fit(train_dataset, epochs=epochs, callbacks=CALLBACKS)
logger.info("Trained model")

for metric, values in history.history.items():
for step, value in enumerate(values):
mlflow.log_metric(f"{metric}", value, step=step)

results = model_nn.evaluate(test_dataset, return_dict=True)
logger.info("Evaluated model")

for metric, value in results.items():
mlflow.log_metric(f"Final {metric}", value)

logger.info(f"Finished training at {datetime.now()}")

try:
mlflow.tensorflow.save_model(
model=model_nn,
path=os.path.join(job_name, model),
)
except TypeError as e:
logger.error(f"Saving model raised an error:\n{e}")

mlflow.tensorflow.log_model(
model=model_nn,
registered_model_name=model,
artifact_path=model,
)

mlflow.end_run()


if __name__ == "__main__":
run() # pylint: disable=no-value-for-parameter
Loading
Loading