Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major rewrite to enable use of newer pysam, fix several bugs, and add filtering options #104

Merged
merged 105 commits into from
Jun 12, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
f8c4537
adding min VAF, RNA fragments, and ratio of alt to other nonref
iskandr Oct 22, 2018
b1a7fb0
resuming work with new filering options
iskandr Feb 13, 2019
4680552
fixed test_variant_helpers
iskandr Feb 13, 2019
534bee3
trying to get ProteinSequenceGenerator working
iskandr Feb 16, 2019
70f2711
added helper for genomes in tests
iskandr Feb 16, 2019
2ac825e
updating tests to use proteinsequencecreator
iskandr Feb 16, 2019
ac33abd
fixing imports in tests
iskandr Feb 18, 2019
1f5dba5
renaming variant_sequence_in_reading_frame to VariantORF
iskandr Feb 18, 2019
83cc32d
finished basic refactoring, now have to figure out why read counts di…
iskandr Feb 21, 2019
780b92e
figuring out how to get isovar to work with newer pysam
iskandr Feb 21, 2019
e5de086
making a unified ReadCreator
iskandr Feb 22, 2019
d32df91
combining LocusReadCreator and AlleleReadCreator into single ReadCreator
iskandr Feb 22, 2019
1b1360a
updating CLI to work with ReadCreator
iskandr Feb 23, 2019
782a031
working on making LocusRead tests correctly match new base0 coordinat…
iskandr Mar 4, 2019
3c173b2
added default converter for None to str '-'
iskandr Mar 4, 2019
b543ba0
working on getting base-0 vs base-1 right in the pysam interacting pa…
iskandr Mar 4, 2019
7819028
working on subtleties of base0 vs. base1 and reference vs. read loci
iskandr Mar 5, 2019
dff9dbc
working on subtleties of base0 vs. base1 and reference vs. read loci
iskandr Mar 5, 2019
016ece3
don't allow partial end of read alleles for insertions
iskandr Mar 5, 2019
dfbc506
got locus tests working again
iskandr Mar 5, 2019
fe43bab
fixing AlleleRead unit tests
iskandr Mar 7, 2019
9dcc59c
adding VariantSequenceCreator and TranslationCreator
iskandr Mar 13, 2019
2b8bbac
split out VariantSequenceCreator and TranslationCreator
iskandr Mar 13, 2019
cf940fd
wrote filtering helper functions on CoverageStats object
iskandr Mar 21, 2019
2fc3874
added more detail to algorithm description
iskandr Mar 25, 2019
b2f1b05
working on pulled results fields out into own object
iskandr Mar 26, 2019
f502941
added IsovarResult
iskandr Mar 27, 2019
88d4b54
getting rid of ProteinSequenceCreator
iskandr Mar 27, 2019
d4c2dab
renamed IsovarResult to VariantResult
iskandr Apr 5, 2019
65ff15f
adding more docstrings
iskandr Apr 5, 2019
2bccfc6
removed nucleotide counts
iskandr Apr 5, 2019
e54c273
moved overlapping genes/transcripts to VariantResult
iskandr Apr 16, 2019
a520b73
added some docstrings to isovar_result
iskandr Apr 29, 2019
1be5420
working on main Isovar entrypoint
iskandr May 7, 2019
2b552bc
updated sort key on ProteinSequence
iskandr May 7, 2019
1e81eaa
adding protein sequence properties to IsovarResult.to_record()
iskandr May 9, 2019
6de76e2
including varcode effect predictions in the IsovarResult
iskandr May 9, 2019
a0f9810
moved transcript_id_whitelist onto IsovarResult
iskandr May 9, 2019
6fecdeb
merged TranslationCreator with ProteinSequenceCreator
iskandr May 10, 2019
12f9da5
renamed GroupedAlleleReads to ReadEvidence
iskandr May 10, 2019
9ba7fb5
reorganizing filter settings
iskandr May 10, 2019
feafa53
adding filtering options to Isovar
iskandr May 10, 2019
ec33a11
added filtering module
iskandr May 10, 2019
9a043ef
moved extra method onto ProteinSequenceCreator
iskandr May 10, 2019
1cfce50
added helper which does filter splits
iskandr May 10, 2019
3ba17c7
merging new and old variant effect code
iskandr May 11, 2019
8195e76
added pandas serialization for IsovarResult objects
iskandr May 11, 2019
8b58c47
added test for main entrypoint
iskandr May 11, 2019
701c759
making entrypoint into a function
iskandr May 12, 2019
e5e110c
fixed enough typos to get unit test working
iskandr May 13, 2019
c3d83ec
expanded import list in init
iskandr May 14, 2019
fd3a5d3
adding methods to match unit tests
iskandr May 14, 2019
74f0826
working on fixing up unit tests
iskandr May 14, 2019
2f619a8
more unit test fixes
iskandr May 14, 2019
f4570c3
made some properties cached so I can avoid precomputing things in __i…
iskandr May 14, 2019
621bee0
added filter args and splitting out reference context helpers
iskandr May 14, 2019
296ee33
fixing unit tests and deleting unused code
iskandr May 14, 2019
1af6707
fixed more unit tests
iskandr May 14, 2019
8eb339d
working on getting dataframe conversions to work for CLI
iskandr May 14, 2019
bc76079
fixing more unit tests
iskandr May 14, 2019
939703a
got protein sequence tests working
iskandr May 14, 2019
86c4d3b
finally got rest of unit tests passing
iskandr May 14, 2019
145dbd4
added main CLI entrypoint
iskandr May 14, 2019
87acddf
adding filters to main command
iskandr May 14, 2019
6b1af52
actually create DataFrame for results
iskandr May 14, 2019
506ad2a
added filter flags, expanded first end-to-end test
iskandr May 15, 2019
fe8c2f3
added main CLI test
iskandr May 15, 2019
9a57c5f
added helpers to check whether Isovar's protein sequence matched varcode
iskandr May 15, 2019
a1fa404
small fixes to isovar result
iskandr May 15, 2019
59cffa0
added more properties to result
iskandr May 15, 2019
0a3f9de
added unit test for types of IsovarResult properties
iskandr May 15, 2019
f0f49fc
moved filtering code to own module
iskandr May 16, 2019
812b468
got rid of extra prints
iskandr May 16, 2019
a958728
added more information to IsovarResult about cDNA sequences
iskandr May 16, 2019
50d5e33
added alignment score tests
iskandr May 16, 2019
e8a307a
fixed alignment_score
iskandr May 16, 2019
a07fba7
got rid of keywords for pysam fetch to make other pysam versions happy
iskandr May 16, 2019
8ca03e9
fixed unit test data path on travis
iskandr May 16, 2019
2ce4636
fixed staticmethod
iskandr May 16, 2019
0e33bf5
improving README
iskandr May 16, 2019
886a8fd
removed integration tests while we're incompatible with Vaxrank
iskandr May 16, 2019
8c5072e
expanded README
iskandr May 16, 2019
d17cc9b
added filter tests
iskandr May 16, 2019
4b44700
expanded README
iskandr May 17, 2019
d5401d0
got rid of unused filter functions
iskandr May 19, 2019
3fed1b1
removed unused helper functions
iskandr May 19, 2019
c42abbb
added more comments
iskandr Jun 10, 2019
5cc88fe
added more comments to README
iskandr Jun 10, 2019
d9a3776
experimenting with definition list for extra commands
iskandr Jun 10, 2019
c9a2d8a
trying html in REAMDE
iskandr Jun 10, 2019
f8977ae
dl for command list in docs
iskandr Jun 10, 2019
495f1ff
removed stars for DL entries at bottom of docs
iskandr Jun 10, 2019
d50175d
moved sequencing recommendations to the bottom
iskandr Jun 10, 2019
0703443
added more information to README
iskandr Jun 10, 2019
9afc30f
alias 'pass' and 'passes_all_filters'
iskandr Jun 10, 2019
97348d2
added TOC
iskandr Jun 10, 2019
eabde39
updated names in README
iskandr Jun 10, 2019
db4f0c6
rename to Internal Design
iskandr Jun 10, 2019
2e96455
added diagram
iskandr Jun 10, 2019
638dbfe
moved diagram to the top
iskandr Jun 10, 2019
517bcf4
diagram
iskandr Jun 10, 2019
e980f61
slight refactoring of CLI to make it easier for reuse from Vaxrank
iskandr Jun 11, 2019
5b24e68
moved output option out of shared code for CLI args
iskandr Jun 11, 2019
e07ae4d
import new CLI helpers in __init__
iskandr Jun 11, 2019
fbc01de
added kwargs to helper which creates ArgumentParser
iskandr Jun 11, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,7 @@ target/

#Ipython Notebook
.ipynb_checkpoints

# PyCharm
.idea/
vcs.xml
6 changes: 4 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,10 @@ before_script:
--custom-mirror https://github.com/openvax/ensembl-data/releases/download/GRCm38.87/
script:
- >
nosetests test --with-coverage --cover-package=isovar &&
nosetests openvax-integration-tests/test
nosetests test --with-coverage --cover-package=isovar
# removing integration tests until Vaxrank is updated to use new API
# &&
# nosetests openvax-integration-tests/test
after_success:
coveralls
deploy:
Expand Down
395 changes: 362 additions & 33 deletions README.md

Large diffs are not rendered by default.

34 changes: 32 additions & 2 deletions isovar/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2016-2018. Mount Sinai School of Medicine
# Copyright (c) 2016-2019. Mount Sinai School of Medicine
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -14,4 +14,34 @@

from __future__ import print_function, division, absolute_import

__version__ = "0.9.0"
__version__ = "1.0.0"


from .allele_read import AlleleRead
from .dataframe_helpers import isovar_results_to_dataframe
from .isovar_result import IsovarResult
from .locus_read import LocusRead
from .main import run_isovar
from .protein_sequence import ProteinSequence
from .protein_sequence_creator import ProteinSequenceCreator
from .read_collector import ReadCollector
from .read_evidence import ReadEvidence
from .variant_orf import VariantORF
from .variant_sequence import VariantSequence
from .variant_sequence_creator import VariantSequenceCreator


__all__ = [
"run_isovar",
"isovar_results_to_dataframe",
"AlleleRead",
"IsovarResult",
"LocusRead",
"ProteinSequence",
"ProteinSequenceCreator",
"ReadCollector",
"ReadEvidence",
"VariantORF",
"VariantSequence",
"VariantSequenceCreator",
]
86 changes: 86 additions & 0 deletions isovar/alignment_score.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Copyright (c) 2019. Mount Sinai School of Medicine
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


"""
Sequence alignment helpers
"""

from __future__ import print_function, division, absolute_import


def alignment_score(a, b, min_subsequence_length=1):
"""
Number of mismatches between all two input sequences, allows
for trimming of ends of sequences but not insertions or deletions
within the sequences. Number of trimmed amino acids from each sequence
count toward the mismatch total.

Parameters
----------
a : str

b : str

min_subsequence_length : int
Only consider subsequences which are at least this long.
Returns int
"""

n_a = len(a)
n_b = len(b)

# swap a and b if a is longer since the loops below expect the first
# sequence to be shorter. This happens because any subsequence of `a`
# is expected to be of a length that can be found within `b`
if n_a > n_b:
a, b = b, a
n_a, n_b = n_b, n_a

# compare all subsequences of a and b and count the number of mismatches
# between
best_score = n_a + n_b

# if we need to make sequence of at least length `min_subsequence_length`
# then we can't start the sequence more than that number of characters
# from the end of the string.
# For example, if n_a = 7 and min_subsequence_length = 6 then
# this loop should only consider start_a = {0, 1} but not any value
# higher than that.
for start_a in range(n_a - min_subsequence_length + 1):
# Similarly, only consider end indices for the subsequence of `a`
# which span the minimum number of characters required.
# For example, if n_a = 7, min_subsequence_length = 6, start_a = 1
# then this loop should only consider end_a = 7.
for end_a in range(start_a + min_subsequence_length, n_a + 1):
subseq_a = a[start_a:end_a]
n_subseq_a = len(subseq_a)
n_trimmed_a = n_a - n_subseq_a
# consider all subsequences of the second string of the same length
# as the subsequence extracted from the first string
for start_b in range(n_b - n_subseq_a + 1):
subseq_b = b[start_b:start_b + n_subseq_a]
n_subseq_b = len(subseq_b)
assert n_subseq_a == n_subseq_b
n_trimmed_b = n_b - n_subseq_b
# now that we have two subsequences of the same length,
# count up the number of mismatching characters between them
n_mismatches = sum([
a_char != b_char
for (a_char, b_char)
in zip(subseq_a, subseq_b)])
score = n_trimmed_a + n_trimmed_b + n_mismatches
if score < best_score:
best_score = score
return best_score
59 changes: 0 additions & 59 deletions isovar/allele_counts.py

This file was deleted.

108 changes: 108 additions & 0 deletions isovar/allele_read.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Copyright (c) 2016-2019. Mount Sinai School of Medicine
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
Reads overlapping a locus of interest split into prefix,
allele (ref, alt, or otherwise), and suffix portions
"""

from __future__ import print_function, division, absolute_import
import logging

from .string_helpers import convert_from_bytes_if_necessary, trim_N_nucleotides
from .value_object import ValueObject

logger = logging.getLogger(__name__)


class AlleleRead(ValueObject):
"""
Extremely simplified representation of a read at a locus: just the allele
at the locus and sequence before/after. We're ignoring the base qualities
and any additional information about splicing, clipping or alignment.
"""
__slots__ = ["prefix", "allele", "suffix", "name", "sequence"]

def __init__(self, prefix, allele, suffix, name):
self.prefix = prefix
self.allele = allele
self.suffix = suffix
self.name = name
self.sequence = prefix + allele + suffix

def __len__(self):
return len(self.sequence)

@classmethod
def from_locus_read(cls, locus_read):
"""
Given a single LocusRead object, return either an AlleleRead or None

Parameters
----------
locus_read : LocusRead
Read which overlaps a variant locus but doesn't necessarily contain the
alternate nucleotides
"""
sequence = locus_read.sequence
read_name = locus_read.name

reference_base0_start_inclusive = locus_read.reference_base0_start_inclusive
reference_base0_end_exclusive = locus_read.reference_base0_end_exclusive

read_base0_start_inclusive = locus_read.read_base0_start_inclusive
read_base0_end_exclusive = locus_read.read_base0_end_exclusive

if read_base0_start_inclusive is None or read_base0_end_exclusive is None:
logger.debug(
"Skipping read '%s' because required bases in reference interval %s:%s aren't mapped",
read_name,
reference_base0_start_inclusive,
reference_base0_end_exclusive)
return None

reference_positions = locus_read.reference_positions

n_ref_bases = reference_base0_end_exclusive - reference_base0_start_inclusive

insertion = (n_ref_bases == 0)

if insertion:
# insertions require a sequence of non-aligned bases
# followed by the subsequence reference position
for read_index in range(read_base0_start_inclusive, read_base0_end_exclusive):
# all the inserted nucleotides should *not* align to the reference
if reference_positions[read_index] is not None:
logger.debug(
"Skipping read '%s', inserted nucleotides shouldn't map to reference",
read_name)
return None

nucleotides_at_variant_locus = convert_from_bytes_if_necessary(
sequence[read_base0_start_inclusive:read_base0_end_exclusive])

if "N" in nucleotides_at_variant_locus:
logger.debug(
"Skipping read '%s', found N nucleotides at variant locus",
read_name)
prefix = convert_from_bytes_if_necessary(sequence[:read_base0_start_inclusive])
suffix = convert_from_bytes_if_necessary(sequence[read_base0_end_exclusive:])

prefix, suffix = trim_N_nucleotides(prefix, suffix)

return AlleleRead(
prefix,
nucleotides_at_variant_locus,
suffix,
name=read_name)
Loading