Skip to content

Reproducible analysis and data for "Identifying molecular interactions in omics data for clinical biomarker discovery" paper

Notifications You must be signed in to change notification settings

MiquelTriana/QLattice-clinical-omics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QLattice Clinical Omics paper

Reproducible analysis and data for "Identifying molecular interactions in omics data for clinical biomarker discovery" paper

Reproduce results

Create environment

We recommend using Virtualenv to create an environment. We use python 3.8.

From a terminal in the root folder:

virtualenv venv -p 3.8
source venv/bin/activate
pip install -r requirements.txt

Run notebooks

Under the folder notebooks there is a folder for each one of the four cases discussed in the paper. All models, performance metrics, and figures can be reproduced by running the notebooks.

QLattice tutorial

Below is a basic tutorial on how to use the QLattice to find models that relate the input variables of a dataset to its output variable. Its Jupyter notebook can be found in the notebooks folder of this repository. Other tutorials can be found in the Feyn+QLattice documentation page.


Finding AD biomarkers in proteomics data

Feyn version: 2.1+

Can the QLattice deal with omics data that is noisy and contains thousands of features? It certainly can!

Omics data typically contains hundreds to thousands of features (proteins, transcripts, methylated DNA etc.) that are measured in samples derived from sources such as blood, tissue or cell culture. These types of approaches are often used for exploratory analysis e.g. in biomarker discovery or understanding the mechanism of action of a certain drug. It often resembles a bit of a "fishing exercise".

Thus, there is a need to quickly and reliably identify the most important features and their interactions that contribute to a certain signal (e.g. disease state, cell-type identity, cancer detection).

In this tutorial we present a brief workflow for building simple and interpretable models for proteomics data. This specific example is taken from a study by Bader & Geyer et al. 2020 (Mann group) and contains samples taken from the cerebrospinal fluid of Alzheimer Disease (AD) patients and non-AD patients. We will show you how to build QLattice model that can classify people into AD and non-AD according to their proteomic profiles.

The dataset contains over a thousand features (features in this example describe the intensity of different proteins measured by mass spectrometry).

import numpy as np
import pandas as pd
import feyn

from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Load the data

Note, the data has been preprocessed and missing values have been imputed. It contains 1166 proteins and 88 non-AD and 49 AD subjects.

data = pd.read_csv("../data/ad_omics.csv")

# Let's record the categorical data types in our dataset (note features will be treated as numerical by default).
stypes = {}
for f in data.columns:
    if data[f].dtype =='object':
        stypes[f] = 'c'

Split dataset into train and test set

# Set random seed for reproducibility
random_seed = 42

# Define the target variable
target = "_clinical AD diagnosis"

# Split
train, test = train_test_split(data, test_size=0.33, stratify=data[target], random_state=random_seed)

Train the QLattice

Sample and fit models

This occurs in the following steps:

  1. Sample models from the QLattice;
  2. Fit the models by minimizing BIC (Bayesian Information Criterion);
  3. Update the QLattice with the best models' structures;
  4. Repeat the process;

This is all captured within the auto_run function

# Connecting
ql = feyn.connect_qlattice()

# Reset and set random seed
ql.reset(random_seed=random_seed)

# Sample and fit models
models = ql.auto_run(
    data=train,
    output_name=target,
    kind='classification',
    stypes=stypes,
    n_epochs=20
    )

drawing

Inspect the top model

best = models[0]
best.plot(train, test)

model plot

With the plot below, we inspect the Pearson correlation between the values at each node and the true output:

best.plot_signal(train)

signal

As expected, MAPT (i.e. Tau) seems to be driving most of the signal here. Let's investigate further.

Explore features

Let's look at how the different features play together.

show_quantiles = 'NID2'
fixed = {}
fixed[show_quantiles] = [
    train[show_quantiles].quantile(q=0.25),
    train[show_quantiles].quantile(q=0.5),
    train[show_quantiles].quantile(q=0.75)
]

best.plot_response_1d(train, by = "MAPT", input_constraints=fixed)

png

This response plot shows you how higher NID2 levels shift the MAPT curve to the left. I.e. the higher your NID2 levels, the lower your MAPT levels have to be for a positive AD prediction.

About

Reproducible analysis and data for "Identifying molecular interactions in omics data for clinical biomarker discovery" paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published