Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

G statistic request #35

Open
plunkert opened this issue Oct 16, 2024 · 2 comments
Open

G statistic request #35

plunkert opened this issue Oct 16, 2024 · 2 comments

Comments

@plunkert
Copy link

Hello! I'm wondering if you would consider implementing the G statistic as in Magwene et al. 2011 (attached)? It's widely used on pool-seq data to find loci underlying trait variation in both laboratory crosses and natural accessions (e.g. Gould et al., 2017, attached). It also includes read depth information so as to account for variation in the uncertainty of allele frequency estimates across the genome. I'm not an expert in these statistics and I don't know if the pool-sequencing corrected Fst from grenedalf accomplishes the same thing.

To calculate G statistic for my current work, I've been doing variant calling with SNAPE-pooled and then running Billie Gould's script from her paper (https://bitbucket.org/billiegould/genomics_tools/src/master/SNAPEtools/G_calcSNAPE.py) and then using R subsetting to filter by read depth criteria. SNAPE-pooled isn't being maintained anymore and needs to run on one chromosome at a time, and I'd love to have fewer steps to string together when calculating G statistic.

Magwene 2011.pdf
Gould - Molecular Ecology - 2016 - Pooled ecotype sequencing reveals candidate genetic mechanisms for adaptive.pdf

Thanks so much for your time! Of course I completely understand if you can't prioritize the G statistic, but I appreciate your consideration. Please let me know if there's any other information I can provide.

Best,
Madison

@lczech
Copy link
Owner

lczech commented Oct 21, 2024

Hi Madison,

Thanks for the suggestion, that indeed seems interesting and relevant! I had a look at the original manuscript you shared, as well as the python source, and it seems that this would fit well into grenedalf. However, we'd need to do a more thorough evaluation of this first - it is for instance unclear to me where in the program the pool size (or number of individuals in the bulk) are given to the script, which however seem to be needed for computing G (or G' - not sure which one it is you want from a first glance).

However, I recently changed positions, and am working mainly on other topics these days, so unfortunately, it is a bit hard for me to find the time and justification to work on this. If you or someone in your group (Lowry lab, if I see that correctly?) are willing to collaborate on figuring out the statistics and other questions that might pop up, I could help getting the code into grenedalf.

As for SNAPE-pooled: Is that program actually being used in practice? I knew of it, but it seems so dis-functional that I did not really think it is useful any more. I just had a look at it again and could not even run on its own example file (program did not terminate after waiting for a while, for a file with 6 positions). If their approach is being used though, it might be another addition to grenedalf that I would consider to add (at some point...).

Cheers and so long
Lucas

@plunkert
Copy link
Author

Hi Lucas,

Billie implemented G and I've been using that, but it would be worth considering G'. I could ask the authors about it.

Billie's Gstat code doesn't take number of individuals in the pool as input; my understanding is that it's using solely the read coverage and the SNP calls from SNAPE-pooled, and the number of individuals in the pool is part of SNAPE-pooled input.

My lab uses SNAPE-pooled and got it to run after some hair-pulling, mainly motivated by the paper below which is quite recent and found that SNAPE-pooled performed better than other pool-seq variant callers. It does have 34 citations in the last 4 years, even with the challenges using it.

Molecular Ecology Resources - 2021 - Guirao‐Rico - Benchmarking the performance of Pool‐seq SNP callers using simulated and.pdf

Yes, I'm in the Lowry lab - let me discuss the collaboration idea with David and I'll get back to you over email.

Thanks!
Madison

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants