Merge pull request #8 from jhawkey/fix

Fix
jhawkey · Oct 7, 2015 · 1d95a0e · 1d95a0e
2 parents 75228d9 + 66da87c
commit 1d95a0e
Show file tree

Hide file tree

Showing 6 changed files with 160 additions and 88 deletions.
diff --git a/CHANGES.txt b/CHANGES.txt
@@ -1,3 +1,12 @@
+v0.1.4, Wed 7 Oct 2015
+    * can now supply output directory using --directory command
+    * can now ask BWA to use multiple threads using --t command
+    * fixed some cases where an unpaired hit would pair incorrectly with itself
+    * correct distances returned if the flanking gene should be the first or last
+      gene in the reference
+    * fixed an error where no hits reported but bed files are non-empty
+    * possible related IS call must now meet 50% ID and 50% coverage threshold
+      otherwise the hit is discarded
 v0.1.3, Mon 10 Aug 2015
     * fixed soft clipped read selection (Special thanks to Katie Cox for helping with this one)
     * bam files now deleted by default, can be kept using the --bam flag

diff --git a/README.md b/README.md
@@ -159,7 +159,7 @@ Once ISMapper has finished running, multiple output files will be generated, the
 
 ## Common Issues
 
-When running ISMapper on typing mode with a reference genome in Genbank format, check to see if all CDS/tRNA/rRNA features have the locus_tag qualifier. This is the default setting for ISMapper when choosing what to print in the final table. If the reference genome doesn't have locus_tag's, ISMapper will throw an error in the final table create step. This can be avoided by changing the setting of `--cds` (or `--trna` or `--rrna`) flags to db_xref gene product (or whatever other identifiers are present in your genbank file). 
+When running ISMapper on typing mode with a reference genome in Genbank format, check to see if all CDS/tRNA/rRNA features have the locus_tag qualifier. This is the default setting for ISMapper when choosing what to print in the final table. If the reference genome doesn't have locus_tag's, ISMapper will throw an error in the final table create step. This can be avoided by changing the setting of `--cds` (or `--trna` or `--rrna`) flags to db_xref gene product (or whatever other identifiers are present in your genbank file).
 
 ## Adavnced Options for ismap
 
@@ -173,7 +173,7 @@ When running ISMapper on typing mode with a reference genome in Genbank format,
 
 `--min_clip` and `--max_clip` are used to determine the size of soft clipped sections of reads that are including in the final mapping step. The default size is currently 10 bp and 30 bp, which is ideal for reads between 100 - 150bp. To improve the detection of the exact target site duplication size, it is sometimes helpful to increase the size of `--max_clip`. However large values of `--max_clip` can increase the amount of noise in the final mapping step and cause nonsensical results.
 
-`--a` and `--T` are flags that are passed to BWA. --a will turn on all alignment reporting in BWA, and --T is used to give an integer mapping score to BWA to determine what alignments are kept. These options may be useful in finding IS query positions that are next to repeated elements as BWA will report all hits for the read not just the best random hit. However, using these options may cause noise and confusion in the final output files.
+`--a`, `--T` and `--t` are flags that are passed to BWA. --a will turn on all alignment reporting in BWA, and --T is used to give an integer mapping score to BWA to determine what alignments are kept. These options may be useful in finding IS query positions that are next to repeated elements as BWA will report all hits for the read not just the best random hit. However, using these options may cause noise and confusion in the final output files. `--t` is used to supply more threads to BWA if required.
 
 `--cds`, `--trna` and `--rrna` are used to specify what qualifiers will be looked for in the reference genbank when determining genes flanking the IS query location. Defaults are locus_tag gene product.
 
@@ -183,9 +183,11 @@ When running ISMapper on typing mode with a reference genome in Genbank format,
 
 `--log` turns on the log file.
 
+`--directory` sets an output directory for the output files (defualt is the directory where ISMapper is being run).
+
 `--temp` turns on keeping the temporary files instead of deleting them once the run has completed.
 
-`--bam` turns on keeping the final sorted and indexed BAM files of flanking reads for comparison against the reference genome (by default these files are deleted to save disk space). 
+`--bam` turns on keeping the final sorted and indexed BAM files of flanking reads for comparison against the reference genome (by default these files are deleted to save disk space).
 
 
 ## Other options for compiled_table
@@ -198,7 +200,7 @@ When running ISMapper on typing mode with a reference genome in Genbank format,
 
 ## Running ISMapper without installing  
 
-ISMapper can be run directly from its directory without installing it via pip. To do so, ismap.py needs the path to the folder that contains all the scripts supplied to the argument `--path`. 
+ISMapper can be run directly from its directory without installing it via pip. To do so, ismap.py needs the path to the folder that contains all the scripts supplied to the argument `--path`.
 
 eg:  
 `ismap.py --reads x_1.fastq.gz x_2.fastq.gz --queries is_query.fasta --path /path/to/IS_mapper/scripts/`
@@ -211,7 +213,9 @@ eg:
 ## Running multiple jobs on a SLURM queing system
 
 There is no need to set the --output flag in other_args for this script, it sets it itself using the name of the readset.  
-Also, make sure the query sequence is already index with bwa (using `bwa mem query.fasta`) to prevent corruption of the index once multiple jobs start running.
+Also, make sure the query sequence is already index with bwa (using `bwa mem query.fasta`) to prevent corruption of the index once multiple jobs start running.  
+
+When running in improvement mode, ensure that --assemblies and --assembliesid is passed to the slurm script, but --extension and --type are passed to the ismap script through --other_args.
 
 ```
 python slurm_ismap.py -h
@@ -252,4 +256,4 @@ optional arguments:
   --other_args OTHER_ARGS
                         String containing all other arguments to pass to
                         ISMapper
-```
+```
diff --git a/scripts/compiled_table.py b/scripts/compiled_table.py
@@ -295,6 +295,17 @@ def findFeatureAfterPosition(features, isPosition, m):
     # If we are looking for the feature to the right of the
     # IS position, then either m or m+1 is our answer
 
+    # an index error will occur if m is the final feature, so just check that the first part is true
+    # and return m
+    try:
+        features[m+1]
+    except IndexError:
+        if features[m][0] > isPosition:
+            return features[m][2]
+        # otherwise we must be after the final position, so need to
+        # return the start position of the very first feature
+        else:
+            return features[0][2]
     # If the end of the m feature is before the IS position,
     # then m is before the IS and m+1 is the correct feature
     if features[m][1] < isPosition:
@@ -307,15 +318,22 @@ def findFeatureAfterPosition(features, isPosition, m):
     # then m will be closer to the IS and is the correct feature
     elif features[m][0] > isPosition and features[m+1][0] > isPosition:
         return features[m][2]
-
     else:
         return "3 - THIS SHOULDN'T HAPPEN!"
 
-def get_flanking_genes(features, feature_list, left, right, cds_features, trna_features, rrna_features):
+def get_flanking_genes(features, feature_list, left, right, cds_features, trna_features, rrna_features, genome_size):
 
     # Find the correct indexes
     left_feature_index = binary_search(feature_list, left, 'L')
     right_feature_index = binary_search(feature_list, right, 'R')
+    # Print out information if returning one of the error codes
+    if type(left_feature_index) != int or type(right_feature_index) != int:
+        print 'left index'
+        print left_feature_index
+        print 'right index'
+        print right_feature_index
+        print 'left position: ' + str(left)
+        print 'right position: ' + str(right)
     # Extract the SeqFeature object that corresponds to that index
     left_feature = features[left_feature_index]
     right_feature = features[right_feature_index]
@@ -331,6 +349,32 @@ def get_flanking_genes(features, feature_list, left, right, cds_features, trna_f
     left_dist = abs(max(left_feature.location.start, left_feature.location.end) - left)
     # The distance to the right gene is the startmost position of the feature - the right IS coord
     right_dist = abs(min(right_feature.location.start, right_feature.location.end) - right)
+
+    # Here is probably a good place to check if we've got a position that wraps around from start
+    # to end of the reference
+    # If we've got a distance that is close to the size of the reference, then we know we need to 
+    # alter it
+    if left_dist in range(int(round(genome_size * 0.9)), int(round(genome_size * 1.1))):
+        # The the left hand feature is at the end of the genome
+        # Distance from IS position to start of the genome is the
+        # position itself
+        dist_to_start = left
+        # Distance from the end of the final gene to the end of the
+        # genome
+        dist_to_end = abs(left_feature.location.end - genome_size)
+        # So the total distnace is those two added together
+        left_dist = dist_to_start + dist_to_end
+
+    elif right_dist in range(int(round(genome_size * 0.9)), int(round(genome_size * 1.1))):
+        # Then the right hand feature is at the start of the genome
+        # Distance from the IS position to the end of the genome
+        dist_to_end = abs(genome_size - right)
+        # Distance from the start of the genome to the start of the first feature
+        # is the start position of the first feature
+        dist_to_feature = right_feature.location.start
+        # So the total distance is those two added together
+        right_dist = dist_to_end + dist_to_feature
+
     # The first string in this values list is the main gene id (eg locus_tag)
     left_gene = [left_values[0], str(left_dist), left_values[1:]]
     right_gene = [right_values[0], str(right_dist), right_values[1:]]
@@ -528,10 +572,11 @@ def main():
             feature_count += 1
         else:
             feature_count += 1
-
+    # Sort the list just in case it's out of order (has caused issues in the past!!)
+    feature_list = sorted(feature_list, key=itemgetter(0))
     # Get flanking genes
     for pos in list_of_positions:
-        genes_before, genes_after =  get_flanking_genes(gb.features, feature_list, pos.x, pos.y, args.cds, args.trna, args.rrna)
+        genes_before, genes_after =  get_flanking_genes(gb.features, feature_list, pos.x, pos.y, args.cds, args.trna, args.rrna, len(gb.seq))
         pos.left_feature = genes_before
         pos.right_feature = genes_after