Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mash dist is binary!!! #123

Open
aperrin opened this issue Oct 8, 2019 · 8 comments
Open

Mash dist is binary!!! #123

aperrin opened this issue Oct 8, 2019 · 8 comments

Comments

@aperrin
Copy link

aperrin commented Oct 8, 2019

Hi!

I have a big problem with last mash version (2.2).
When I want to calculate distance matrix between 2 genomes, it returns a binary matrix (p-value is either 0 or 1).
I tried with a fake genome (10k bp), and it works (I have float values). But with real bacterial genomes, it returns a binary matrix.

I attach the files I used for these tests. Here are the results I get with mash dist genome1.txt genome2.txt and so on:

genome1 and genome2: I get a distance of 0.0379382
genome3 and genome4: I get a distance of 1, whereas I should have 0.295981

genome1.txt
genome2.txt
genome3.txt
genome4.txt

@tseemann
Copy link
Contributor

tseemann commented Oct 15, 2019

Why do you say "whereas I should have 0.295981" ?
Perhaps try fastANI? https://github.com/ParBLiSS/FastANI

% seqkit stat genome?.txt
file         format  type  num_seqs    sum_len    min_len    avg_len    max_len
genome1.txt  FASTA   DNA          1      5,747      5,747      5,747      5,747
genome2.txt  FASTA   DNA          4     10,722          6    2,680.5      5,608
genome3.txt  FASTA   DNA          1  1,587,120  1,587,120  1,587,120  1,587,120
genome4.txt  FASTA   DNA         78  2,997,537        536     38,430    666,660

% mash triangle genome?.txt
	4
genome1.txt
genome2.txt	0.0379382
genome3.txt	1	1
genome4.txt	1	1	1
Max p-value: 1

@aperrin
Copy link
Author

aperrin commented Oct 16, 2019

Indeed, I forgot this information. 0.295981 is the distance I obtained before updating mash to this last version. So maybe the "real distance" is not exactly that, but it should be close.

So, when you try with mash triangle (or mash sketch for the whole matrix), you obtain the same result as me. A distance between 2 small genomes, but 1 if at least 1 of the genomes is "big". With my previous version, I had distances for all couples. Do you know why it doesn't work anymore?

Thanks!

@tseemann
Copy link
Contributor

What version did you get the result you wanted on?
Are you using 2.2 or 2.2.1 for the "bad" result?

@aperrin
Copy link
Author

aperrin commented Oct 28, 2019

I was using version 1.1.1 to get the results with "float numbers"

I updated to version 2.2 (the last release), with which I get those binary distances.

@tseemann
Copy link
Contributor

FYI 2.2.2 is the latest tag, but not a 'release' per se. Nothing related to your issue has changed though. But it is faster at getting the possibly wrong answer :) https://github.com/marbl/Mash/tags .

@aperrin
Copy link
Author

aperrin commented Oct 29, 2019

Ok. So it is not planned to correct this bug?

@tseemann
Copy link
Contributor

@aperrin I am not the author of this package. I'm just a global github citizen. Ask @ondovb . I'm not even convinced its a bug. The v1 behaviour may be the bug. Your genomes are totally different in size. They should not have a good mash distance in my opinion!

@aperrin
Copy link
Author

aperrin commented Nov 4, 2019

Ok! But I tried with 2 "big" genomes of the same size, and it does not work either.
Whereas with 2 "small" genomes of the same size it works.
Anyway, thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants