Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

masking when averaging by window-length #24

Open
capoony opened this issue Jun 28, 2024 · 3 comments
Open

masking when averaging by window-length #24

capoony opened this issue Jun 28, 2024 · 3 comments

Comments

@capoony
Copy link

capoony commented Jun 28, 2024

Hi Lucas,

thanks for the great new features -praticulary the averaging options are very useful. However, I am wondering if you could also implement the possibility to mask when in averaging by window-length. As far as I understand this is not possible yet.

In our case, we have already called SNPs and want to average diversity stats in windows (by window-length) based on a SYNC file with SNPs only. In addition, we have BED files, which contain the regions that should be masked in the genomes.

When I am using --window-average-policy valid-loci in combination with --filter-mask-fasta ${PWD}/data/BED/${ID}.mask.gz I get a lot of missing positions, but (I guess) the correct number of masked sites per window.

When I am using --window-average-policy window-length in combination with --filter-mask-bed ${PWD}/data/BED/${ID}.bed.gz I get no (or a few??) missing positions, but also no masked sites per window.

Thanks, Martin

@lczech
Copy link
Owner

lczech commented Jul 22, 2024

Hi Martin,

thanks for your patience, finally circling back to working on grenedalf.

To make sure that I am understanding your suggestion correctly: Do you want to be able to provide a mask file that determines the denominator of each window? Such as: In a given window of length l, the mask file has x positions marked as masked out and y positions not masked, with x + y = l. Then, to compute the window average of a statistic, we simply divide the sum of per-site values of the statistic by y. In other words, a mask file that is taken as "ground truth" for the positions that shall and shall not be considered for the window averaging. That does indeed sound useful! I think that could be an additional type of mask, which would then only be used for the window averaging computation.

Would that be a solution to what you are thinking? If not, I am not understanding what exactly you would want to compute there - in that case, could you please clarify with an example?

Cheers
Lucas

@lczech
Copy link
Owner

lczech commented Aug 2, 2024

Hi Martin @capoony,

I just released grenedalf v0.6.0 which implements all of the above features. Let me know if this works for you, or if this does not solve your use case :-)

Cheers
Lucas

@lczech lczech closed this as completed Aug 2, 2024
@lczech
Copy link
Owner

lczech commented Aug 2, 2024

Hey Martin,

just to also answer your comments in a bit more detail. I think I might have been a bit quick to close this issue - reopening now, so that we can discuss this.

When I am using --window-average-policy valid-loci in combination with --filter-mask-fasta ${PWD}/data/BED/${ID}.mask.gz I get a lot of missing positions, but (I guess) the correct number of masked sites per window.

Yes, the missing positions are expected there, because when specifying a mask, we assume that this is to be taken as ground truth, and so in order to apply that, we need to fill in all positions for which there is no input as an empty missing position, to which then the mask can be applied. The alternative would be to keep the input without filling in the missing positions, but then the mask might specify positions as unmasked that are not there, and so they would still be ignored, despite the mask telling us that we should not.

Does that make sense? I could also deactivate that the missing positions are filled in when a mask is specified, and instead leave that to the user to decide (there is an option --make-gapless for that already anyway). Let me know what you think :-)

When I am using --window-average-policy window-length in combination with --filter-mask-bed ${PWD}/data/BED/${ID}.bed.gz I get no (or a few??) missing positions, but also no masked sites per window.

Interesting - you should see missing positions there as well, I think. Can you maybe test with the latest v0.6.0 again, and see if that changed? If not, can you maybe provide a minimal example for me to check?

As for the window average policy, I think that the valid-snps might be what you are looking for. If no other filters are applied, this will exactly take the positions of the mask as valid for the averaging.

Cheers
Lucas

@lczech lczech reopened this Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants