Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

higher than expected coverage in output bam #6

Open
rhalperin opened this issue May 31, 2022 · 1 comment
Open

higher than expected coverage in output bam #6

rhalperin opened this issue May 31, 2022 · 1 comment

Comments

@rhalperin
Copy link

I ran bamsifter with -c 20, but in my output bam, I'm generally seeing coverage peaks around 50-60 reads. Here is an IGV screenshot, where the top track is the bamsifter output and the bottom track is the original bam
image

Is this what you would expect the output to look like?

@GeorgescuC
Copy link
Collaborator

Hi @rhalperin ,

Bamsifter tries to always have the target number n of reads when there are enough reads and not go over, but there are multiple reasons why the coverage can often exceed the target.
Generally, the first n reads of a covered region will be automatically selected, then when one of those reads ends, a new one is selected to keep the coverage high enough. However when some of the selected reads do not align to part of the reference such as when splicing or deletions occur (as identified in the CIGAR string), if the coverage for a given base pair drops below the target n, we select more reads until we reach the threshold again to make up for that.
When using paired end reads, each selected read also automatically selects its pair so that we keep as much relevant information as possible. This does in counterpart mean that if we already reached the target n coverage and find the pair of an already selected read, we will go over the target n coverage to keep it.
There is also the option (disabled by default) to keep all chimeric reads (useful if you are looking for fusion transcripts) that can contribute to this.
For efficiency, the input bam is only read through once when selecting reads, the once when copying them, so we don't go back and try to correct oversampling.

It is however possible that the higher coverage you see is not explained by any of these reasons, in which case I can take a look at the specific issue if you can share an example BAM (the specific region from your screenshot for example).

Regards,
Christophe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants