genomics training example #122

saralinker · 2021-08-04T23:42:46Z

Do you have an example script for training the genomics model? I am attempting to apply this approach more broadly, and am starting by trying to replicate your example, but my weights are not correct. Any help would be much appreciated!

I'm including the code here that I've written (repurposed from your code) to try to train the model in case that is helpful:

#####################################

Training Genomics Model

#####################################

from future import print_function
import tensorflow
print("Tensorflow version:", tensorflow.version)
import keras
print("Keras version:", keras.version)
import numpy as np
print("Numpy version:", np.version)

from tensorflow.keras.models import model_from_json
import simdna.synthetic as synthetic

#####################################

Import Model Architecture from the original DeepLift code

#####################################
keras_model_json = "keras2_conv1d_record_5_model_PQzyq_modelJson.json"
keras_model = model_from_json(open(keras_model_json).read())
keras_model_config = keras_model.get_config()

model_empty = tensorflow.keras.Sequential().from_config(keras_model_config)

#####################################

Convert Training Set to One Hot Encoding

####################################
def one_hot_encode_along_channel_axis(sequence):
to_return = np.zeros((len(sequence),4), dtype=np.int8)
seq_to_one_hot_fill_in_array(zeros_array=to_return,
sequence=sequence, one_hot_axis=1)
return to_return

def seq_to_one_hot_fill_in_array(zeros_array, sequence, one_hot_axis):
assert one_hot_axis==0 or one_hot_axis==1
if (one_hot_axis==0):
assert zeros_array.shape[1] == len(sequence)
elif (one_hot_axis==1):
assert zeros_array.shape[0] == len(sequence)
#will mutate zeros_array
for (i,char) in enumerate(sequence):
if (char=="A" or char=="a"):
char_idx = 0
elif (char=="C" or char=="c"):
char_idx = 1
elif (char=="G" or char=="g"):
char_idx = 2
elif (char=="T" or char=="t"):
char_idx = 3
elif (char=="N" or char=="n"):
continue #leave that pos as all 0's
else:
raise RuntimeError("Unsupported character: "+str(char))
if (one_hot_axis==0):
zeros_array[char_idx,i] = 1
elif (one_hot_axis==1):
zeros_array[i,char_idx] = 1

#read in the data in the training set
data_filename = "sequences.simdata"
train_ids_fh = open("test.txt","r")
ids_to_load = [x.rstrip("\n") for x in train_ids_fh]
#read_simdata_file adds three lists: ids, sequences, embeddings, and labels
data = synthetic.read_simdata_file(data_filename, ids_to_load=ids_to_load)

onehot_data = np.array([one_hot_encode_along_channel_axis(seq) for seq in data.sequences])

#####################################

Train Model

####################################

model_empty.compile(loss="mse", optimizer="sgd")
model_empty.fit(onehot_data, data.labels)
model_empty.save_weights("new_model.h5", save_format='h5')

AvantiShri · 2021-10-20T09:44:32Z

Hi @saralinker, sorry for the slow response - I was on medical leave last quarter.

In terms of a tutorial for training genomics models, I think this notebook by Ziga Avsec is a good place to start; it trains a very simple model with 1 convolutional layer, but hopefully it's enough to give you a grounding: https://colab.research.google.com/github/Avsecz/DL-genomics-exercise/blob/master/Simulated.ipynb. Note that colab notebooks currently default to tensorflow version 2, and if you want to force an earlier version of tensorflow you need to execute the command %tensorflow_version 1.x at the beginning of the notebook.

When you say your "weights are not correct", can you be more specific? In case you were running into an hdf5 error with reading the model weights, this was because the model weights were saved with an earlier version of the h5py library; you have to use h5py < 3.0.0 for reading the weights to work. I have updated the example colab notebook in the deeplift repo to reflect this: https://colab.research.google.com/github/kundajelab/deeplift/blob/master/examples/genomics/genomics_simulation.ipynb

In terms of interpretation, if you have trouble using this particular deeplift repository, then you might have more luck using the DeepSHAP implementation (DeepSHAP is an extension of deeplift, and the implementation is done in a more flexible way such that it works with a wider array of models). I have an example notebook using DeepSHAP here: https://colab.research.google.com/github/AvantiShri/shap/blob/5fdad0651176cdbf1acd6c697604a71529695211/notebooks/deep_explainer/Tensorflow%20DeepExplainer%20Genomics%20Example%20With%20Hypothetical%20Importance%20Scores.ipynb. I also have detailed slides from a lab meeting I gave on using DeepSHAP, in case those are helpful: https://docs.google.com/presentation/d/1JCLMTW7ppA3Oaz9YA2ldDgx8ItW9XHASXM1B3regxPw/edit?usp=sharing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genomics training example #122

genomics training example #122

saralinker commented Aug 4, 2021

AvantiShri commented Oct 20, 2021

genomics training example #122

genomics training example #122

Comments

saralinker commented Aug 4, 2021

Training Genomics Model

Import Model Architecture from the original DeepLift code

Convert Training Set to One Hot Encoding

Train Model

AvantiShri commented Oct 20, 2021