Skip to content

Commit

Permalink
Merge pull request #8 from jhawkey/fix
Browse files Browse the repository at this point in the history
Fix
  • Loading branch information
Jane Hawkey committed Oct 7, 2015
2 parents 75228d9 + 66da87c commit 1d95a0e
Show file tree
Hide file tree
Showing 6 changed files with 160 additions and 88 deletions.
9 changes: 9 additions & 0 deletions CHANGES.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
v0.1.4, Wed 7 Oct 2015
* can now supply output directory using --directory command
* can now ask BWA to use multiple threads using --t command
* fixed some cases where an unpaired hit would pair incorrectly with itself
* correct distances returned if the flanking gene should be the first or last
gene in the reference
* fixed an error where no hits reported but bed files are non-empty
* possible related IS call must now meet 50% ID and 50% coverage threshold
otherwise the hit is discarded
v0.1.3, Mon 10 Aug 2015
* fixed soft clipped read selection (Special thanks to Katie Cox for helping with this one)
* bam files now deleted by default, can be kept using the --bam flag
Expand Down
16 changes: 10 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ Once ISMapper has finished running, multiple output files will be generated, the

## Common Issues

When running ISMapper on typing mode with a reference genome in Genbank format, check to see if all CDS/tRNA/rRNA features have the locus_tag qualifier. This is the default setting for ISMapper when choosing what to print in the final table. If the reference genome doesn't have locus_tag's, ISMapper will throw an error in the final table create step. This can be avoided by changing the setting of `--cds` (or `--trna` or `--rrna`) flags to db_xref gene product (or whatever other identifiers are present in your genbank file).
When running ISMapper on typing mode with a reference genome in Genbank format, check to see if all CDS/tRNA/rRNA features have the locus_tag qualifier. This is the default setting for ISMapper when choosing what to print in the final table. If the reference genome doesn't have locus_tag's, ISMapper will throw an error in the final table create step. This can be avoided by changing the setting of `--cds` (or `--trna` or `--rrna`) flags to db_xref gene product (or whatever other identifiers are present in your genbank file).

## Adavnced Options for ismap

Expand All @@ -173,7 +173,7 @@ When running ISMapper on typing mode with a reference genome in Genbank format,

`--min_clip` and `--max_clip` are used to determine the size of soft clipped sections of reads that are including in the final mapping step. The default size is currently 10 bp and 30 bp, which is ideal for reads between 100 - 150bp. To improve the detection of the exact target site duplication size, it is sometimes helpful to increase the size of `--max_clip`. However large values of `--max_clip` can increase the amount of noise in the final mapping step and cause nonsensical results.

`--a` and `--T` are flags that are passed to BWA. --a will turn on all alignment reporting in BWA, and --T is used to give an integer mapping score to BWA to determine what alignments are kept. These options may be useful in finding IS query positions that are next to repeated elements as BWA will report all hits for the read not just the best random hit. However, using these options may cause noise and confusion in the final output files.
`--a`, `--T` and `--t` are flags that are passed to BWA. --a will turn on all alignment reporting in BWA, and --T is used to give an integer mapping score to BWA to determine what alignments are kept. These options may be useful in finding IS query positions that are next to repeated elements as BWA will report all hits for the read not just the best random hit. However, using these options may cause noise and confusion in the final output files. `--t` is used to supply more threads to BWA if required.

`--cds`, `--trna` and `--rrna` are used to specify what qualifiers will be looked for in the reference genbank when determining genes flanking the IS query location. Defaults are locus_tag gene product.

Expand All @@ -183,9 +183,11 @@ When running ISMapper on typing mode with a reference genome in Genbank format,

`--log` turns on the log file.

`--directory` sets an output directory for the output files (defualt is the directory where ISMapper is being run).

`--temp` turns on keeping the temporary files instead of deleting them once the run has completed.

`--bam` turns on keeping the final sorted and indexed BAM files of flanking reads for comparison against the reference genome (by default these files are deleted to save disk space).
`--bam` turns on keeping the final sorted and indexed BAM files of flanking reads for comparison against the reference genome (by default these files are deleted to save disk space).


## Other options for compiled_table
Expand All @@ -198,7 +200,7 @@ When running ISMapper on typing mode with a reference genome in Genbank format,

## Running ISMapper without installing

ISMapper can be run directly from its directory without installing it via pip. To do so, ismap.py needs the path to the folder that contains all the scripts supplied to the argument `--path`.
ISMapper can be run directly from its directory without installing it via pip. To do so, ismap.py needs the path to the folder that contains all the scripts supplied to the argument `--path`.

eg:
`ismap.py --reads x_1.fastq.gz x_2.fastq.gz --queries is_query.fasta --path /path/to/IS_mapper/scripts/`
Expand All @@ -211,7 +213,9 @@ eg:
## Running multiple jobs on a SLURM queing system

There is no need to set the --output flag in other_args for this script, it sets it itself using the name of the readset.
Also, make sure the query sequence is already index with bwa (using `bwa mem query.fasta`) to prevent corruption of the index once multiple jobs start running.
Also, make sure the query sequence is already index with bwa (using `bwa mem query.fasta`) to prevent corruption of the index once multiple jobs start running.

When running in improvement mode, ensure that --assemblies and --assembliesid is passed to the slurm script, but --extension and --type are passed to the ismap script through --other_args.

```
python slurm_ismap.py -h
Expand Down Expand Up @@ -252,4 +256,4 @@ optional arguments:
--other_args OTHER_ARGS
String containing all other arguments to pass to
ISMapper
```
```
53 changes: 49 additions & 4 deletions scripts/compiled_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,17 @@ def findFeatureAfterPosition(features, isPosition, m):
# If we are looking for the feature to the right of the
# IS position, then either m or m+1 is our answer

# an index error will occur if m is the final feature, so just check that the first part is true
# and return m
try:
features[m+1]
except IndexError:
if features[m][0] > isPosition:
return features[m][2]
# otherwise we must be after the final position, so need to
# return the start position of the very first feature
else:
return features[0][2]
# If the end of the m feature is before the IS position,
# then m is before the IS and m+1 is the correct feature
if features[m][1] < isPosition:
Expand All @@ -307,15 +318,22 @@ def findFeatureAfterPosition(features, isPosition, m):
# then m will be closer to the IS and is the correct feature
elif features[m][0] > isPosition and features[m+1][0] > isPosition:
return features[m][2]

else:
return "3 - THIS SHOULDN'T HAPPEN!"

def get_flanking_genes(features, feature_list, left, right, cds_features, trna_features, rrna_features):
def get_flanking_genes(features, feature_list, left, right, cds_features, trna_features, rrna_features, genome_size):

# Find the correct indexes
left_feature_index = binary_search(feature_list, left, 'L')
right_feature_index = binary_search(feature_list, right, 'R')
# Print out information if returning one of the error codes
if type(left_feature_index) != int or type(right_feature_index) != int:
print 'left index'
print left_feature_index
print 'right index'
print right_feature_index
print 'left position: ' + str(left)
print 'right position: ' + str(right)
# Extract the SeqFeature object that corresponds to that index
left_feature = features[left_feature_index]
right_feature = features[right_feature_index]
Expand All @@ -331,6 +349,32 @@ def get_flanking_genes(features, feature_list, left, right, cds_features, trna_f
left_dist = abs(max(left_feature.location.start, left_feature.location.end) - left)
# The distance to the right gene is the startmost position of the feature - the right IS coord
right_dist = abs(min(right_feature.location.start, right_feature.location.end) - right)

# Here is probably a good place to check if we've got a position that wraps around from start
# to end of the reference
# If we've got a distance that is close to the size of the reference, then we know we need to
# alter it
if left_dist in range(int(round(genome_size * 0.9)), int(round(genome_size * 1.1))):
# The the left hand feature is at the end of the genome
# Distance from IS position to start of the genome is the
# position itself
dist_to_start = left
# Distance from the end of the final gene to the end of the
# genome
dist_to_end = abs(left_feature.location.end - genome_size)
# So the total distnace is those two added together
left_dist = dist_to_start + dist_to_end

elif right_dist in range(int(round(genome_size * 0.9)), int(round(genome_size * 1.1))):
# Then the right hand feature is at the start of the genome
# Distance from the IS position to the end of the genome
dist_to_end = abs(genome_size - right)
# Distance from the start of the genome to the start of the first feature
# is the start position of the first feature
dist_to_feature = right_feature.location.start
# So the total distance is those two added together
right_dist = dist_to_end + dist_to_feature

# The first string in this values list is the main gene id (eg locus_tag)
left_gene = [left_values[0], str(left_dist), left_values[1:]]
right_gene = [right_values[0], str(right_dist), right_values[1:]]
Expand Down Expand Up @@ -528,10 +572,11 @@ def main():
feature_count += 1
else:
feature_count += 1

# Sort the list just in case it's out of order (has caused issues in the past!!)
feature_list = sorted(feature_list, key=itemgetter(0))
# Get flanking genes
for pos in list_of_positions:
genes_before, genes_after = get_flanking_genes(gb.features, feature_list, pos.x, pos.y, args.cds, args.trna, args.rrna)
genes_before, genes_after = get_flanking_genes(gb.features, feature_list, pos.x, pos.y, args.cds, args.trna, args.rrna, len(gb.seq))
pos.left_feature = genes_before
pos.right_feature = genes_after

Expand Down
Loading

0 comments on commit 1d95a0e

Please sign in to comment.