Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip subject kmers that exceed the max variants treshold #4

Open
zwets opened this issue Oct 10, 2018 · 0 comments
Open

Skip subject kmers that exceed the max variants treshold #4

zwets opened this issue Oct 10, 2018 · 0 comments

Comments

@zwets
Copy link
Owner

zwets commented Oct 10, 2018

In khc's current implementation, the subject sequences can have degenerate (a.k.a. ambiguous) bases and the -j option specifies the maximum number of variants these are allowed to generate (per k-mer).

For instance, the degenerate 5-mer RAGTY is matched by four 5-mers: AAGTC, AAGTT GAGTC, GAGTT. In khc's implementation, all four k-mers are 'generated' and put in the k-mer index, each pointing at the location of RAGTY in its subject sequence.

The -j option is needed because the number of variants explodes quickly. For instance, five Ns (occurring anywhere in any k-mer) generate 1024 variants, and an all-N 15-mer would generate a billion variants, and obviously fill every entry in the index if our k-size was 15. It also would be totally useless, because it would be hit by every k-mer in the query.

Currently, the -j option is set to 1024 by default, and the program will abort if it encounters a k-mer with more than 1024 variants. We could add an option to make the program continue and skip that k-mer (do not add its variants to the index). We currently have that option (-s) for k-mers in the query sequence, but not for the subject sequences.

The argument to ignore k-mers above a certain number of variants (apart from the inefficiency of clogging the index) is that they are aspecific anyway: the chance of them being (spuriously) hit goes up quickly. In the current version we must abort because we cannot ignore any k-mers (however degenerate) in MLST, since it requires finding 100% identity matches so all locations from the subject must be represented in the index.

However, when not doing MLST (but e.g. resistance finding), we could add the option to skip any k-mers that generate too many variants. More generally, we would be dropping the less specific k-mers. If we would then also 'deduct' their locations from the subject sequences, so they wouldn't count in the denominator of the coverage percentage, then this could bring a net benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant