masking when averaging by window-length #24

capoony · 2024-06-28T10:36:03Z

Hi Lucas,

thanks for the great new features -praticulary the averaging options are very useful. However, I am wondering if you could also implement the possibility to mask when in averaging by window-length. As far as I understand this is not possible yet.

In our case, we have already called SNPs and want to average diversity stats in windows (by window-length) based on a SYNC file with SNPs only. In addition, we have BED files, which contain the regions that should be masked in the genomes.

When I am using --window-average-policy valid-loci in combination with --filter-mask-fasta ${PWD}/data/BED/${ID}.mask.gz I get a lot of missing positions, but (I guess) the correct number of masked sites per window.

When I am using --window-average-policy window-length in combination with --filter-mask-bed ${PWD}/data/BED/${ID}.bed.gz I get no (or a few??) missing positions, but also no masked sites per window.

Thanks, Martin

The text was updated successfully, but these errors were encountered:

lczech · 2024-07-22T10:22:36Z

Hi Martin,

thanks for your patience, finally circling back to working on grenedalf.

To make sure that I am understanding your suggestion correctly: Do you want to be able to provide a mask file that determines the denominator of each window? Such as: In a given window of length l, the mask file has x positions marked as masked out and y positions not masked, with x + y = l. Then, to compute the window average of a statistic, we simply divide the sum of per-site values of the statistic by y. In other words, a mask file that is taken as "ground truth" for the positions that shall and shall not be considered for the window averaging. That does indeed sound useful! I think that could be an additional type of mask, which would then only be used for the window averaging computation.

Would that be a solution to what you are thinking? If not, I am not understanding what exactly you would want to compute there - in that case, could you please clarify with an example?

Cheers
Lucas

lczech · 2024-08-02T01:12:17Z

Hi Martin @capoony,

I just released grenedalf v0.6.0 which implements all of the above features. Let me know if this works for you, or if this does not solve your use case :-)

Cheers
Lucas

lczech · 2024-08-02T16:37:04Z

Hey Martin,

just to also answer your comments in a bit more detail. I think I might have been a bit quick to close this issue - reopening now, so that we can discuss this.

When I am using --window-average-policy valid-loci in combination with --filter-mask-fasta ${PWD}/data/BED/${ID}.mask.gz I get a lot of missing positions, but (I guess) the correct number of masked sites per window.

Yes, the missing positions are expected there, because when specifying a mask, we assume that this is to be taken as ground truth, and so in order to apply that, we need to fill in all positions for which there is no input as an empty missing position, to which then the mask can be applied. The alternative would be to keep the input without filling in the missing positions, but then the mask might specify positions as unmasked that are not there, and so they would still be ignored, despite the mask telling us that we should not.

Does that make sense? I could also deactivate that the missing positions are filled in when a mask is specified, and instead leave that to the user to decide (there is an option --make-gapless for that already anyway). Let me know what you think :-)

When I am using --window-average-policy window-length in combination with --filter-mask-bed ${PWD}/data/BED/${ID}.bed.gz I get no (or a few??) missing positions, but also no masked sites per window.

Interesting - you should see missing positions there as well, I think. Can you maybe test with the latest v0.6.0 again, and see if that changed? If not, can you maybe provide a minimal example for me to check?

As for the window average policy, I think that the valid-snps might be what you are looking for. If no other filters are applied, this will exactly take the positions of the mask as valid for the averaging.

Cheers
Lucas

lczech mentioned this issue Jul 22, 2024

Masking sample-wise #25

Closed

lczech added a commit that referenced this issue Aug 1, 2024

Add window averaging with a provided mask #24

079c128

lczech closed this as completed Aug 2, 2024

lczech reopened this Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

masking when averaging by window-length #24

masking when averaging by window-length #24

capoony commented Jun 28, 2024 •

edited by lczech

Loading

lczech commented Jul 22, 2024 •

edited

Loading

lczech commented Aug 2, 2024

lczech commented Aug 2, 2024

masking when averaging by window-length #24

masking when averaging by window-length #24

Comments

capoony commented Jun 28, 2024 • edited by lczech Loading

lczech commented Jul 22, 2024 • edited Loading

lczech commented Aug 2, 2024

lczech commented Aug 2, 2024

capoony commented Jun 28, 2024 •

edited by lczech

Loading

lczech commented Jul 22, 2024 •

edited

Loading