getting started notebook #9

gagneur · 2018-10-11T20:54:54Z

Package description: ”Standard set of data-loaders for training and making predictions for DNA sequence-based models.” -> “Set of data-loaders for Kipoi sequence-based models”. We don’t inforce a standard, do we? Not just DNA.
kipoiseq -> Shall we have “data loader” in the package name -> kipoidlseq / kipoiseqdl / kipoi_dlseq? Not sure.

Notebook

have a quick intro "This notebook gives a short introduction to kipoiseq, a package providing data-loaders for Kipoi sequence-based models. Examples for accessing sequences on disk, batch training of sequence-based models, and for applying a trained sequence-based model on data by batch are provided”.
the genomic intervals are 0-based which is maybe expected from a python point of view but not necessarily obvious from out there (GFF format uses 1-based coordinates). Write ”You can see that the first interval matches the one provided in the intervals file (kipoiseq assumes the 0-based convention).”. You may want to stress this elsewhere too in the doc. A typical pitfall.
have “We will now apply a model on these sequences that assume all sequences to be of length 10. Since the intervals are of variable length, we have to resize them. Calling SeqDataset() with the auto_resize_len parameter returns intervals centered on the center of input intervals and with the desired length.”
How does auto_resize_len deal with odd / even interval length. Why not truncating from the left? What happens when you go outside the chromosome range?
Write “You can load the whole dataset as dictionary into memory using load_all()”
after the call to load_all(), I found more useful to show the full data object.
Have “seqname” instead of “chr”. We used seq_name in GenomeIntervals and GenomicRanges used seqnames for a good reason: Sequences do not have to be chromosomes, eg. vector, construct, MPRA, contigs, transcripts, protein, …
have “We will now train a convolutional neural network on these data. To this end, we define a Keras model consisting of a first convolutional layer with 2 filters of size 3, ReLU activation, max pooling, and a final dense layer.”
have “This model will be trained using the Adam optimizer and binary crossentropy, as loss function, and accuracy as performance metric.”
have “Training is done by batches of size 2. ”
[ ] have “We will now create an iterator from the data loader that returns data by batches. Number of CPU workers is set to 4.” Before deinfing the iterator
iterator = dl.batch_train_iter(batch_size=batch_size, num_workers=4)”
[ ] explain “What does the iterator return?” after initiating it. Afterward, do the call for model fitting that uses the iterator.

gagneur assigned Avsecz Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting started notebook #9

getting started notebook #9

gagneur commented Oct 11, 2018

getting started notebook #9

getting started notebook #9

Comments

gagneur commented Oct 11, 2018