Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finding similar sequences #50

Open
sbdk82 opened this issue Mar 15, 2017 · 3 comments
Open

finding similar sequences #50

sbdk82 opened this issue Mar 15, 2017 · 3 comments

Comments

@sbdk82
Copy link

sbdk82 commented Mar 15, 2017

I am trying to find similar sequences of a given query sequence. I am assuming all sequences are of same length and I have to find the similar ones from a set of million sequences. How could I use mash to achieve this?

@ondovb
Copy link
Member

ondovb commented Mar 15, 2017

Assuming you have the subject set in a multi-fasta:

mash sketch -i subjects.fna
mash sketch query.fna
mash dist subjects.fna.msh query.fna.msh | sort -gk3 > out

Other sketching parameters may be appropriate for your data, but the -i is key.

@sbdk82
Copy link
Author

sbdk82 commented Mar 15, 2017

Thanks !! I want to use the code in my program i.e. generating the output in my C++ program (instead of using command line). Could you please guide me? The inputs are one subjects.txt file containing all sequences (or non-genomic strings) and a query string.

@ondovb
Copy link
Member

ondovb commented Mar 15, 2017

Mash isn't officially encapsulated as a library so that won't be completely straightforward, but it can be done with some copying and pasting. A good place to start would be copying and modifying CommandDistance.cpp, which handles the I/O in run() and has some global functions for the comparisons. A real API is currently a wish-list item, and related to #49.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants