finding similar sequences #50

sbdk82 · 2017-03-15T14:43:24Z

I am trying to find similar sequences of a given query sequence. I am assuming all sequences are of same length and I have to find the similar ones from a set of million sequences. How could I use mash to achieve this?

ondovb · 2017-03-15T17:05:15Z

Assuming you have the subject set in a multi-fasta:

mash sketch -i subjects.fna
mash sketch query.fna
mash dist subjects.fna.msh query.fna.msh | sort -gk3 > out

Other sketching parameters may be appropriate for your data, but the -i is key.

sbdk82 · 2017-03-15T17:19:58Z

Thanks !! I want to use the code in my program i.e. generating the output in my C++ program (instead of using command line). Could you please guide me? The inputs are one subjects.txt file containing all sequences (or non-genomic strings) and a query string.

ondovb · 2017-03-15T17:37:48Z

Mash isn't officially encapsulated as a library so that won't be completely straightforward, but it can be done with some copying and pasting. A good place to start would be copying and modifying CommandDistance.cpp, which handles the I/O in run() and has some global functions for the comparisons. A real API is currently a wish-list item, and related to #49.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finding similar sequences #50

finding similar sequences #50

sbdk82 commented Mar 15, 2017

ondovb commented Mar 15, 2017

sbdk82 commented Mar 15, 2017

ondovb commented Mar 15, 2017

finding similar sequences #50

finding similar sequences #50

Comments

sbdk82 commented Mar 15, 2017

ondovb commented Mar 15, 2017

sbdk82 commented Mar 15, 2017

ondovb commented Mar 15, 2017