Skip to content

UBC-MDS/heart_disease_predictor_py

Repository files navigation

Heart Disease Predictor

Authors

  • Stephanie Wu
  • Albert Halim
  • Rongze Liu
  • Ziyuan Zhao

About

The Heart Disease Predictor project aims to build a reliable machine learning model that predicts the presence of heart disease based on a set of patient health measurements. This project employs data wrangling, exploratory data analysis (EDA), and classification techniques to explore the dataset and develop an accurate model.

The dataset used in this project is sourced from the UCI Machine Learning Repository. The dataset consists of 303 patient records, each including 13 attributes such as age, cholesterol levels, chest pain type, and maximum heart rate achieved. The target variable (num) indicates the presence or absence of heart disease. Our goal is to predict the target variable effectively, helping to assess patients' heart health in a clinical setting.


Project Objectives

  • Data Wrangling: Preprocess the raw dataset to prepare it for analysis.
  • Exploratory Data Analysis (EDA): Investigate relationships between patient features and heart disease presence.
  • Model Development: Train and evaluate a classification model to predict heart disease.
  • Evaluation: Assess the model's performance using metrics like accuracy, confusion matrices, and more.

Our final classifier achieved an overall accuracy of ~87%, which, while promising, indicates further improvements can be made for real-world applicability. False negatives (missed heart disease) remain a primary concern, as they could lead to underdiagnosis.


Dataset Details

The heart disease dataset was originally collected by researchers from four different institutions and compiled by researchers at the Cleveland Clinic Foundation. The attributes in the dataset include:

  • Age: Patient age in years.
  • Sex: Gender of the patient.
  • Chest Pain Type (cp): Type of chest pain experienced (four categories).
  • Resting Blood Pressure (trestbps): Blood pressure at rest.
  • Cholesterol Level (chol): Serum cholesterol in mg/dl.
  • Max Heart Rate (thalach): Maximum heart rate achieved during exercise.

Additional features capture other physiological details, each potentially relevant to heart disease diagnosis.


Report

The final report summarizing our findings and model development can be found here.


Dependencies

For a complete list of dependencies, refer to the environment.yml file.


Setup Instructions

Prerequisites

  • Install Conda to handle dependencies.

Using the Docker Container

To simplify the setup process, we have created a Docker container that includes all necessary dependencies for the Heart Disease Predictor project. Follow the steps below to use the container:

  1. Pull the Docker Image

    • Make sure Docker is installed on your machine. You can pull the latest version of the Docker image from DockerHub by running:
      docker pull <dockerhub-username>/heart_disease_predictor:latest
  2. Run the Docker Container

    • To start a container instance using the pulled image, run:
      docker run -p 8888:8888 -v $(pwd):/home/jovyan/work <dockerhub-username>/heart_disease_predictor:latest
      • This will start a Jupyter Notebook server that you can access in your browser at http://localhost:8888.
      • The -v $(pwd):/home/jovyan/work option mounts your current directory into the container so that you can access your project files.
  3. Using Jupyter Lab

    • Once the container is running, Jupyter Lab should open in your browser. You can run the analysis by navigating to src/heart_disease_predictor_report.ipynb and executing the cells as you would on your local setup.

Running the Analysis

  1. Navigate to the root of this project on your computer using the command line.
  2. Open the Jupyter notebook to start the analysis:
    jupyter lab src/heart_disease_predictor_report.ipynb
  3. Execute the notebook cells to run the data wrangling, EDA, and modeling steps.
    • Make sure the kernel is set to the appropriate environment (heart_disease_predictor).
    • You can select "Restart Kernel and Run All Cells" from the "Kernel" menu to execute all steps in the analysis sequentially.

Updating the Docker Container

If there are changes in the codebase or dependencies, follow the steps below to update the container:

  1. Update the Dependencies

    • If any changes are made to the environment.yml file, you must regenerate the conda-lock file to pin the versions of the updated dependencies:
      conda-lock install --name heart_disease_env --file environment.yml
  2. Rebuild the Docker Image

    • Make sure the updated environment.yml and Dockerfile reflect the latest changes, then rebuild the Docker image using the command:
      docker build -t <dockerhub-username>/heart_disease_predictor:latest .
  3. Push the Updated Image

    • To make the updated image available to others, push it to DockerHub:
      docker push <dockerhub-username>/heart_disease_predictor:latest

Using Docker Compose

To simplify running multiple containers or configuring ports/volumes, Docker Compose can be used. Here is how you can use Docker Compose:

  1. Docker Compose File

    • Create a docker-compose.yml file in the root of your repository that defines the services required:
      version: '3'
      services:
        heart_disease_predictor:
          image: <dockerhub-username>/heart_disease_predictor:latest
          ports:
            - "8888:8888"
          volumes:
            - .:/home/jovyan/work
  2. Running with Docker Compose

    • Use the following command to launch the container with Docker Compose:
      docker-compose up
    • This will start the container, mapping the necessary ports and volumes as specified in the docker-compose.yml file.

Clean up

  • To deactivate the environment:
    conda deactivate

Adding a New Dependency

  1. Add the new dependency to the environment.yml file in a separate branch.
  2. Regenerate the conda-lock file:
    conda-lock install --name heart_disease_predictor --file environment.yml
  3. Test the updated environment and push your changes.

License

All code in the Heart Disease Predictor project is licensed under the MIT License. The project report is licensed under the CC0 1.0 Universal License. If you use or re-mix any part of this project, please provide appropriate attribution.

References