Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting started notebook #9

Open
11 tasks
gagneur opened this issue Oct 11, 2018 · 0 comments
Open
11 tasks

getting started notebook #9

gagneur opened this issue Oct 11, 2018 · 0 comments
Assignees

Comments

@gagneur
Copy link

gagneur commented Oct 11, 2018

  • Package description: ”Standard set of data-loaders for training and making predictions for DNA sequence-based models.” -> “Set of data-loaders for Kipoi sequence-based models”. We don’t inforce a standard, do we? Not just DNA.

  • kipoiseq -> Shall we have “data loader” in the package name -> kipoidlseq / kipoiseqdl / kipoi_dlseq? Not sure.

Notebook

  • have a quick intro "This notebook gives a short introduction to kipoiseq, a package providing data-loaders for Kipoi sequence-based models. Examples for accessing sequences on disk, batch training of sequence-based models, and for applying a trained sequence-based model on data by batch are provided”.

  • the genomic intervals are 0-based which is maybe expected from a python point of view but not necessarily obvious from out there (GFF format uses 1-based coordinates). Write ”You can see that the first interval matches the one provided in the intervals file (kipoiseq assumes the 0-based convention).”. You may want to stress this elsewhere too in the doc. A typical pitfall.

  • have “We will now apply a model on these sequences that assume all sequences to be of length 10. Since the intervals are of variable length, we have to resize them. Calling SeqDataset() with the auto_resize_len parameter returns intervals centered on the center of input intervals and with the desired length.”

  • How does auto_resize_len deal with odd / even interval length. Why not truncating from the left? What happens when you go outside the chromosome range?

  • Write “You can load the whole dataset as dictionary into memory using load_all()”

  • after the call to load_all(), I found more useful to show the full data object.

  • Have “seqname” instead of “chr”. We used seq_name in GenomeIntervals and GenomicRanges used seqnames for a good reason: Sequences do not have to be chromosomes, eg. vector, construct, MPRA, contigs, transcripts, protein, …

  • have “We will now train a convolutional neural network on these data. To this end, we define a Keras model consisting of a first convolutional layer with 2 filters of size 3, ReLU activation, max pooling, and a final dense layer.”

  • have “This model will be trained using the Adam optimizer and binary crossentropy, as loss function, and accuracy as performance metric.”

  • have “Training is done by batches of size 2. ”

  • [ ] have “We will now create an iterator from the data loader that returns data by batches. Number of CPU workers is set to 4.” Before deinfing the iterator
    iterator = dl.batch_train_iter(batch_size=batch_size, num_workers=4)”

  • [ ] explain “What does the iterator return?” after initiating it. Afterward, do the call for model fitting that uses the iterator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants