This project focuses on comprehensive phenotyping of patients with diabetes using a dataset of patient records. The primary objective is to categorize these patients based on their clinical characteristics, such as vitals, eating behaviors, and treatment regimens. The analysis aims to provide detailed insights into different diabetes patient profiles.
The dataset is sourced from UCI Machine Learning Repository and includes outpatient care records for 70 diabetes patients. It is designed for medical research and consists of various patient data points, including insulin doses, blood glucose measurements, meal ingestion, exercise activities, and special events.
- Data Preparation: The dataset is processed to handle missing values, convert date and time to a uniform format, and remove invalid entries.
- Feature Engineering: New features are created from the dataset to aid in the analysis, such as categorizing records based on their time of entry (electronic or paper) and aggregating data by patient ID.
- Exploratory Data Analysis (EDA): Descriptive statistics are used to understand the data better, and outliers are identified and handled appropriately.
- Clustering for Phenotyping: KMeans clustering is used to group patients into clusters based on their clinical characteristics. This enables personalized suggestions for managing blood glucose levels.
- Data Visualization: Various visualizations like heatmaps and parallel coordinates plots are employed to interpret the clusters and understand the patient profiles better.
- Python
- Pandas for data manipulation
- Matplotlib and Seaborn for data visualization
- Scikit-learn for KMeans clustering
- Clone or download the repository.
- Install the required Python libraries.
- Run the Jupyter Notebook provided, following the steps outlined for data preparation, EDA, and clustering.
- Explore the visualizations and interpret the results to understand different patient phenotypes.
- The project identifies distinct clusters representing varying diabetes management styles, from conservative to intensive.
- Each cluster offers unique insights into patient behavior, insulin usage, and blood glucose levels, enabling tailored patient care strategies.
Feel free to fork this repository, make changes, and submit pull requests. Any contributions to improve the analysis or extend the dataset are welcome.