Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two questions related to the Fig1 in the paper #48

Open
caojy-sys opened this issue Sep 8, 2024 · 4 comments
Open

Two questions related to the Fig1 in the paper #48

caojy-sys opened this issue Sep 8, 2024 · 4 comments

Comments

@caojy-sys
Copy link

Hi! I have two questions about Fig 1 in the paper.
The first one is about the indexing part. In Block 1, the number of columns is n-1. What exactly does this n-1 refer to? Does the n-1 (number of columns) change depending on different blocks?
The other question is about the filtering step in the Profiling part. Why should KMCP have three filtering steps? Why not just use the second filter step (the most rigorous round) so that KMCP can only have one round filtering step?
These are the questions that I am concerned about. Hope you can reply as soon as possible. Thank you!

@shenwei356
Copy link
Owner

In Block 1, the number of columns is n-1. What exactly does this n-1 refer to?

  • N is the number of rows, not columns.
  • Each column is a bloom filter, with a size of n bits.
  • The figure marks each bloom filter element with a 0-based index: 0, 1, 2, 3, n-1. Cause when you compute hash%n (mod), the result would be in a range of [0, n-1].

Does the n-1 (number of columns) change depending on different blocks?

Yes. The size of bloom filters is determined by the expected false-positive rate and the length of the largest sequence in a block.

Why not just use the second filter step (the most rigorous round) so that KMCP can only have one round filtering step?

  • The second filtering step is too strict, which would miss some species.
  • After a round of filtering, some multiple reads are assigned to fewer species (even one), making them uniquely matched reads. So an extra step is needed with the new information.

@caojy-sys
Copy link
Author

Sorry for the misunderstanding regarding "rows" and "columns," thank you so much!

@caojy-sys
Copy link
Author

Sorry. I'm still confused about some questions. In the three-round filtering step, why should KMCP use the first filtering round and the third filtering round as the second filtering round is the strictest one so that we don't actually need the first and the third filtering round?
image
Next, what's the difference between Block 1 and Block B? Why should the lengths of R1-c1 and R4-c1 be different? Are the Block 1 and Block B independent of each other? Can we seem the Block 1 as a matrix?
image

@shenwei356
Copy link
Owner

In the three-round filtering step, why should KMCP use the first filtering round and the third filtering round as the second filtering round is the strictest one so that we don't actually need the first and the third filtering round?

After a round of filtering, some multiple reads are assigned to fewer species (even one, making them uniquely matched reads), so another round is needed to use the new statistics to recompute detected reference genomes. However, there's no need to use many rounds, two or three is enough. Like the EM algorithm, each round of filtration improves the result, but it will soon reach the plateau.

In R3, the grey diamond criterion is not used cause it is slow, and there's no need according to my observation.

Next, what's the difference between Block 1 and Block B?

Each block contains data from different genome chunks. In the figure, chunks of a genome are next to each other for simplicity. Actually, k-mers of all chunks are sorted, and divided into multiple blocks. You can also read the COBS paper.

Why should the lengths of R1-c1 and R4-c1 be different?

They might be different or the same.
The size of bloom filters of a block is determined by the FPR and the largest k-mer numbers in the block.

Are the Block 1 and Block B independent of each other?

Yes.

Can we seem the Block 1 as a matrix?

It is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants