Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genomics training example #122

Open
saralinker opened this issue Aug 4, 2021 · 1 comment
Open

genomics training example #122

saralinker opened this issue Aug 4, 2021 · 1 comment

Comments

@saralinker
Copy link

Do you have an example script for training the genomics model? I am attempting to apply this approach more broadly, and am starting by trying to replicate your example, but my weights are not correct. Any help would be much appreciated!

I'm including the code here that I've written (repurposed from your code) to try to train the model in case that is helpful:

#####################################

Training Genomics Model

#####################################

from future import print_function
import tensorflow
print("Tensorflow version:", tensorflow.version)
import keras
print("Keras version:", keras.version)
import numpy as np
print("Numpy version:", np.version)

from tensorflow.keras.models import model_from_json
import simdna.synthetic as synthetic

#####################################

Import Model Architecture from the original DeepLift code

#####################################
keras_model_json = "keras2_conv1d_record_5_model_PQzyq_modelJson.json"
keras_model = model_from_json(open(keras_model_json).read())
keras_model_config = keras_model.get_config()

model_empty = tensorflow.keras.Sequential().from_config(keras_model_config)

#####################################

Convert Training Set to One Hot Encoding

####################################
def one_hot_encode_along_channel_axis(sequence):
to_return = np.zeros((len(sequence),4), dtype=np.int8)
seq_to_one_hot_fill_in_array(zeros_array=to_return,
sequence=sequence, one_hot_axis=1)
return to_return

def seq_to_one_hot_fill_in_array(zeros_array, sequence, one_hot_axis):
assert one_hot_axis==0 or one_hot_axis==1
if (one_hot_axis==0):
assert zeros_array.shape[1] == len(sequence)
elif (one_hot_axis==1):
assert zeros_array.shape[0] == len(sequence)
#will mutate zeros_array
for (i,char) in enumerate(sequence):
if (char=="A" or char=="a"):
char_idx = 0
elif (char=="C" or char=="c"):
char_idx = 1
elif (char=="G" or char=="g"):
char_idx = 2
elif (char=="T" or char=="t"):
char_idx = 3
elif (char=="N" or char=="n"):
continue #leave that pos as all 0's
else:
raise RuntimeError("Unsupported character: "+str(char))
if (one_hot_axis==0):
zeros_array[char_idx,i] = 1
elif (one_hot_axis==1):
zeros_array[i,char_idx] = 1

#read in the data in the training set
data_filename = "sequences.simdata"
train_ids_fh = open("test.txt","r")
ids_to_load = [x.rstrip("\n") for x in train_ids_fh]
#read_simdata_file adds three lists: ids, sequences, embeddings, and labels
data = synthetic.read_simdata_file(data_filename, ids_to_load=ids_to_load)

onehot_data = np.array([one_hot_encode_along_channel_axis(seq) for seq in data.sequences])

#####################################

Train Model

####################################

model_empty.compile(loss="mse", optimizer="sgd")
model_empty.fit(onehot_data, data.labels)
model_empty.save_weights("new_model.h5", save_format='h5')

@AvantiShri
Copy link
Collaborator

Hi @saralinker, sorry for the slow response - I was on medical leave last quarter.

In terms of a tutorial for training genomics models, I think this notebook by Ziga Avsec is a good place to start; it trains a very simple model with 1 convolutional layer, but hopefully it's enough to give you a grounding: https://colab.research.google.com/github/Avsecz/DL-genomics-exercise/blob/master/Simulated.ipynb. Note that colab notebooks currently default to tensorflow version 2, and if you want to force an earlier version of tensorflow you need to execute the command %tensorflow_version 1.x at the beginning of the notebook.

When you say your "weights are not correct", can you be more specific? In case you were running into an hdf5 error with reading the model weights, this was because the model weights were saved with an earlier version of the h5py library; you have to use h5py < 3.0.0 for reading the weights to work. I have updated the example colab notebook in the deeplift repo to reflect this: https://colab.research.google.com/github/kundajelab/deeplift/blob/master/examples/genomics/genomics_simulation.ipynb

In terms of interpretation, if you have trouble using this particular deeplift repository, then you might have more luck using the DeepSHAP implementation (DeepSHAP is an extension of deeplift, and the implementation is done in a more flexible way such that it works with a wider array of models). I have an example notebook using DeepSHAP here: https://colab.research.google.com/github/AvantiShri/shap/blob/5fdad0651176cdbf1acd6c697604a71529695211/notebooks/deep_explainer/Tensorflow%20DeepExplainer%20Genomics%20Example%20With%20Hypothetical%20Importance%20Scores.ipynb. I also have detailed slides from a lab meeting I gave on using DeepSHAP, in case those are helpful: https://docs.google.com/presentation/d/1JCLMTW7ppA3Oaz9YA2ldDgx8ItW9XHASXM1B3regxPw/edit?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants