Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpectedly short protein sequences #90

Open
iskandr opened this issue Nov 27, 2017 · 9 comments
Open

Unexpectedly short protein sequences #90

iskandr opened this issue Nov 27, 2017 · 9 comments

Comments

@iskandr
Copy link
Contributor

iskandr commented Nov 27, 2017

Issue reported by email from Scott Brown:

I have a question regarding the protein sequence output from Isovar. I have seen multiple cases where the protein that is output is shorter than the requested ‘--protein-sequence-length’, but when checking the locus in IGV there appears to be sufficient mutant read support covering both sides of the mutation site. Is there a parameter that is limiting the length of the protein, or is there some other reason that the returned protein is not matching the requested length?

...

It appears to be finding two possible proteins based on two transcripts, and despite having equal read support, choosing the shorter of the two (it outputs MYPTAALIVNLRPNTF, whereas manual inspection suggests to me that it should output AAGAVEWMYPTAALIVNLRPNTF).

Separately, I found another case where the somatic mutation was the very last nucleotide of an exon, and Isovar output the peptide based on two reads which went directly into the intron rather than nearly 20 mutant reads which jumped to the next exon correctly. I suspect this is because only the intronic reads were recruited since Isovar takes reads that map to the region +/-1 nucleotide of the mutation? In this case, right at the edge of the exon, it seems like this may be undesired behaviour, at least when there are more reads containing the mutation which skip the intron and read into the next exon?

Aside from those two cases, I think I have been able to explain all shorter-than-expected proteins due to insufficient numbers of high quality mapped reads.

STDOUT from isovar invocation:

$ PYTHONPATH="/projects/sbrown_prj/bin/isovar/" python2 /projects/sbrown_prj/bin/isovar/script/isovar-protein-sequences.py --vcf POG616_17_81042903_snv.pvcf --bam /projects/POG/POG_data/POG616/rna/biop1_t_P01608/reposition/hg19a_jg-e69_bwa-mem-0.7.6a/C9J57ANXX_8_CCGCAA_withJunctionsOnGenome_dupsFlagged.bam --min-alt-rna-reads 2 --protein-sequence-length 23 --output POG616_17_81042903_isovar.csv

2017-10-25 12:24:58,904 - isovar.allele_reads:200 - INFO - Gathering reads for Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37')
2017-10-25 12:24:58,918 - isovar.allele_reads:208 - INFO - Gathering variant reads for variant Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37') (chromosome = 17, gene names = [u'METRNL'])
2017-10-25 12:24:58,947 - isovar.locus_reads:314 - INFO - Found 486 reads overlapping locus 17: 81042902-81042904
2017-10-25 12:24:58,947 - isovar.variant_sequences:422 - INFO - Initial pool of 28 variant sequences (min length=39, max length=72)
2017-10-25 12:24:58,948 - isovar.assembly:140 - INFO - Collapsed 28 -> 19 sequences
2017-10-25 12:24:58,954 - isovar.variant_sequences:440 - INFO - After overlap assembly: 3 variant sequences (min length=72, max length=72)
2017-10-25 12:24:58,955 - isovar.variant_sequences:207 - INFO - Coverage: [14 14 14 14 14 14 14 14 14 14 15 15 15 16 17 19 19 19 20 20 21 21 22 23 23
 24 25 25 25 25 26 27 27 28 28 28 28 28 28 27 27 27 27 26 25 25 25 25 25 24
 23 23 23 23 22 20 19 19 18 18 17 16 16 16 16 16 16 16 16 16 15 15] (len=72)
2017-10-25 12:24:58,955 - isovar.variant_sequences:207 - INFO - Coverage: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 4 5 5 5 5 6 7 7 8 8 8 8
 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7] (len=72)
2017-10-25 12:24:58,956 - isovar.variant_sequences:207 - INFO - Coverage: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] (len=72)
2017-10-25 12:24:58,956 - isovar.variant_sequences:330 - INFO - Kept 1/3 variant sequences after read coverage trimming to >=2x
2017-10-25 12:24:58,956 - isovar.variant_sequences:456 - INFO - After coverage & length filtering: 1 variant sequences (min length=72, max length=72)
2017-10-25 12:24:59,500 - pyensembl.sequence_data:95 - INFO - Loaded sequence dictionary from /home/sbrown/.cache/pyensembl/GRCh37/ensembl75/Homo_sapiens.GRCh37.75.cdna.all.fa.gz.pickle
2017-10-25 12:24:59,795 - pyensembl.sequence_data:95 - INFO - Loaded sequence dictionary from /home/sbrown/.cache/pyensembl/GRCh37/ensembl75/Homo_sapiens.GRCh37.75.pep.all.fa.gz.pickle
2017-10-25 12:24:59,859 - isovar.effect_prediction:65 - INFO - Predicted total 3 effects for variant Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37')
2017-10-25 12:24:59,859 - isovar.effect_prediction:73 - INFO - Keeping 3/3 effects which affect protein coding sequence for Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37'): <EffectCollection with 3 elements>
  -- Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-001, transcript_id=ENST00000320095, effect_description=p.G87A)
  -- Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-005, transcript_id=ENST00000570778, effect_description=p.G5A)
  -- Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-003, transcript_id=ENST00000571814, effect_description=p.G5A)
2017-10-25 12:24:59,860 - isovar.effect_prediction:84 - INFO - Keeping 3 effects with predictable AA sequences for Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37'): [Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-001, transcript_id=ENST00000320095, effect_description=p.G87A), Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-005, transcript_id=ENST00000570778, effect_description=p.G5A), Substitution(variant=chr17 g.81042903G>C, transcript_name=METRNL-003, transcript_id=ENST00000571814, effect_description=p.G5A)]
2017-10-25 12:24:59,860 - isovar.reference_sequence_key:113 - INFO - Interbase offset range on METRNL-001 for variant Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37') = 384:385
2017-10-25 12:24:59,860 - isovar.reference_sequence_key:113 - INFO - Interbase offset range on METRNL-005 for variant Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37') = 120:121
2017-10-25 12:24:59,860 - isovar.reference_sequence_key:113 - INFO - Interbase offset range on METRNL-003 for variant Variant(contig='17', start=81042903, ref='G', alt='C', reference_name='GRCh37') = 954:955
2017-10-25 12:24:59,860 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED]', cdna_alt='C', cdna_suffix='[REDACTED]', reference_prefix='[REDACTED]', reference_suffix='[REDACTED]', n_trimmed=0
2017-10-25 12:24:59,861 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED][REDACTED]', offset_to_first_complete_codon=2, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED]', reference_cdna_sequence_after_variant='[REDACTED]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)
2017-10-25 12:24:59,861 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED]', cdna_alt='C', cdna_suffix='[REDACTED]', reference_prefix='[REDACTED]', reference_suffix='[REDACTED]', n_trimmed=0
2017-10-25 12:24:59,861 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED][REDACTED]', offset_to_first_complete_codon=23, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED]', reference_cdna_sequence_after_variant='[REDACTED]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)
2017-10-25 12:24:59,861 - isovar.protein_sequences:293 - INFO - TranslationKey(amino_acids='MYPTAALIVNLRPNTF', variant_aa_interval_start=4, variant_aa_interval_end=5, ends_with_stop_codon=False, frameshift=False): 29 alt reads supporting protein sequence (gene names = set([u'METRNL']))
2017-10-25 12:24:59,861 - isovar.protein_sequences:305 - INFO - TranslationKey(amino_acids='MYPTAALIVNLRPNTF', variant_aa_interval_start=4, variant_aa_interval_end=5, ends_with_stop_codon=False, frameshift=False): protein sequence = MYPTAALIVNLRPNTF
2017-10-25 12:24:59,861 - isovar.protein_sequences:293 - INFO - TranslationKey(amino_acids='AAGAVEWMYPTAALIVNLRPNTF', variant_aa_interval_start=11, variant_aa_interval_end=12, ends_with_stop_codon=False, frameshift=False): 29 alt reads supporting protein sequence (gene names = set([u'METRNL']))
2017-10-25 12:24:59,861 - isovar.protein_sequences:305 - INFO - TranslationKey(amino_acids='AAGAVEWMYPTAALIVNLRPNTF', variant_aa_interval_start=11, variant_aa_interval_end=12, ends_with_stop_codon=False, frameshift=False): protein sequence = AAGAVEWMYPTAALIVNLRPNTF

It does seem like both AAGAVEWMYPTAALIVNLRPNTF and MYPTAALIVNLRPNTF have the same number of reads, maybe there's no logic for when there's a tie in coverage?

@iskandr
Copy link
Contributor Author

iskandr commented Nov 27, 2017

The second problematic case discovered is due to #55 -- which should be fixed when we switch to interbase coordinates for gathering locus reads.

@iskandr
Copy link
Contributor Author

iskandr commented Nov 27, 2017

Possible fix: include protein sequence length as the 3rd sorting criterion in https://github.com/hammerlab/isovar/blob/master/isovar/protein_sequences.py#L164

@iskandr
Copy link
Contributor Author

iskandr commented Nov 27, 2017

@scottdbrown -- do you think including protein sequence length as part of the sorting criteria would fix your issue? Or, is it altogether unexpected for one of the returned sequences to be a subsequence of another?

@iskandr
Copy link
Contributor Author

iskandr commented Nov 27, 2017

@scottdbrown -- you redacted the cDNA sequence lengths for the two translation keys but do you mind posting those here?

@scottdbrown
Copy link

Here are the lengths of the redacted sections:

2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED 36 nts]', cdna_alt='C', cdna_suffix='[REDACTED 35 nts]', reference_prefix='[REDACTED 36 nts]', reference_suffix='[REDACTED 36 nts]', n_trimmed=0
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED 72 nts]', offset_to_first_complete_codon=2, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED 36 nts]', reference_cdna_sequence_after_variant='[REDACTED 36 nts]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:105 - INFO - cdna_predix='[REDACTED 36 nts]', cdna_alt='C', cdna_suffix='[REDACTED 35 nts]', reference_prefix='[REDACTED 36 nts]', reference_suffix='[REDACTED 36 nts]', n_trimmed=0
2017-11-27 11:25:43,425 - isovar.variant_sequence_in_reading_frame:354 - INFO - Iter #1/3: VariantSequenceInReadingFrame(cdna_sequence='[REDACTED 72 nts]', offset_to_first_complete_codon=23, variant_cdna_interval_start=36, variant_cdna_interval_end=37, reference_cdna_sequence_before_variant='[REDACTED 36 nts]', reference_cdna_sequence_after_variant='[REDACTED 36 nts]', number_mismatches_before_variant=0, number_mismatches_after_variant=0)

@iskandr -- Yes, I think sorting by length would fix the issue - the protein sequence that was output was the start of the shorter of the two transcripts, which was entirely contained within the longer transcript. The reads covered the region upstream of the shorter transcript (which is part of the longer transcript).

@iskandr
Copy link
Contributor Author

iskandr commented Nov 27, 2017 via email

@scottdbrown
Copy link

Sorry, due to privacy concerns with this sample, I'm not able to share any germline sequence.
Line 1 and 2 appear to be identical to line 3 and 4 (including the sequence), aside from the "offset_to_first_complete_codon" value.

@scottdbrown
Copy link

Is there any other information I can provide that would help? I apologize for the inconvenience of not being able to provide the actual sequence.

@scottdbrown
Copy link

To get around this issue, I manually created the mutation of interest in a non-protected sequence file, and have attached all relevant files for you to hopefully be able to recreate the issue.

PYTHONPATH="/projects/sbrown_prj/bin/isovar/" python2 /projects/sbrown_prj/bin/isovar/script/isovar-protein-sequences.py --vcf COLO829_17_81042903_snv.pvcf --bam COLO829_17_81042903_reads_1mut.bam --min-alt-rna-reads 2 --protein-sequence-length 23 --output COLO829_17_81042903_isovar_180111.csv > COLO829_17_81042903_isovar_stdout_180111.txt

Input and output files can be found here: https://github.com/scottdbrown/isovar_COLO829_test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants