You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In khc's current implementation, the subject sequences can have degenerate (a.k.a. ambiguous) bases and the -j option specifies the maximum number of variants these are allowed to generate (per k-mer).
For instance, the degenerate 5-mer RAGTY is matched by four 5-mers: AAGTC, AAGTTGAGTC, GAGTT. In khc's implementation, all four k-mers are 'generated' and put in the k-mer index, each pointing at the location of RAGTY in its subject sequence.
The -j option is needed because the number of variants explodes quickly. For instance, five Ns (occurring anywhere in any k-mer) generate 1024 variants, and an all-N 15-mer would generate a billion variants, and obviously fill every entry in the index if our k-size was 15. It also would be totally useless, because it would be hit by every k-mer in the query.
Currently, the -j option is set to 1024 by default, and the program will abort if it encounters a k-mer with more than 1024 variants. We could add an option to make the program continue and skip that k-mer (do not add its variants to the index). We currently have that option (-s) for k-mers in the query sequence, but not for the subject sequences.
The argument to ignore k-mers above a certain number of variants (apart from the inefficiency of clogging the index) is that they are aspecific anyway: the chance of them being (spuriously) hit goes up quickly. In the current version we must abort because we cannot ignore any k-mers (however degenerate) in MLST, since it requires finding 100% identity matches so all locations from the subject must be represented in the index.
However, when not doing MLST (but e.g. resistance finding), we could add the option to skip any k-mers that generate too many variants. More generally, we would be dropping the less specific k-mers. If we would then also 'deduct' their locations from the subject sequences, so they wouldn't count in the denominator of the coverage percentage, then this could bring a net benefit.
The text was updated successfully, but these errors were encountered:
In
khc
's current implementation, the subject sequences can have degenerate (a.k.a. ambiguous) bases and the-j
option specifies the maximum number of variants these are allowed to generate (per k-mer).For instance, the degenerate 5-mer
RAGTY
is matched by four 5-mers:AAGTC
,AAGTT
GAGTC
,GAGTT
. Inkhc
's implementation, all four k-mers are 'generated' and put in the k-mer index, each pointing at the location ofRAGTY
in its subject sequence.The
-j
option is needed because the number of variants explodes quickly. For instance, fiveN
s (occurring anywhere in any k-mer) generate 1024 variants, and an all-N 15-mer would generate a billion variants, and obviously fill every entry in the index if our k-size was 15. It also would be totally useless, because it would be hit by every k-mer in the query.Currently, the
-j
option is set to 1024 by default, and the program will abort if it encounters a k-mer with more than 1024 variants. We could add an option to make the program continue and skip that k-mer (do not add its variants to the index). We currently have that option (-s
) for k-mers in the query sequence, but not for the subject sequences.The argument to ignore k-mers above a certain number of variants (apart from the inefficiency of clogging the index) is that they are aspecific anyway: the chance of them being (spuriously) hit goes up quickly. In the current version we must abort because we cannot ignore any k-mers (however degenerate) in MLST, since it requires finding 100% identity matches so all locations from the subject must be represented in the index.
However, when not doing MLST (but e.g. resistance finding), we could add the option to skip any k-mers that generate too many variants. More generally, we would be dropping the less specific k-mers. If we would then also 'deduct' their locations from the subject sequences, so they wouldn't count in the denominator of the coverage percentage, then this could bring a net benefit.
The text was updated successfully, but these errors were encountered: