Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify thinPoints to handle cases when extreme p-values exist #27

Open
amstilp opened this issue Mar 30, 2020 · 4 comments
Open

Modify thinPoints to handle cases when extreme p-values exist #27

amstilp opened this issue Mar 30, 2020 · 4 comments

Comments

@amstilp
Copy link
Contributor

amstilp commented Mar 30, 2020

When there are variants with very extreme p-values, variants with more modest but still significant variants can get thinned. I saw this in an actual analysis, and here's an reproducible example showing what happens:

library(dplyr)
library(TopmedPipeline)

set.seed(123)
# Random p-values representing the majority of variants.
dat <- data.frame(
  pval = runif(10000)
) %>%
  # A "hit" region with modest p-values.
  bind_rows(
    data.frame(
      pval = 10^(-1 * runif(5, min = 6, max = 8))
    )
  ) %>%
  # A really strong hit, possibly an artifact but not necessarily.
  bind_rows(
    data.frame(
      pval = 10^(-1 * runif(5, min = 100, max = 120))
    )
  ) %>%
  mutate(
    obs = pval
  ) %>%
  arrange(obs) %>%
  mutate(
    x = row_number(),
    exp = x / n(),
    upper = qbeta(0.025, x, rev(x)),
    lower = qbeta(0.975, x, rev(x)),
    logp = -log10(pval)
  )

# Thin points using standard function.
thinned <- thinPoints(dat, "logp", n = 100, nbins = 10)
dat$selected <- dat$x %in% thinned$x
dat$bonferroni <- dat$pval < 0.05 / nrow(dat)

# Look at the set of variants that were not selected by thinPoints but are bonferroni-significant.
# It's the hit region with modest p-values.
dat %>% as_tibble() %>% filter(!selected & bonferroni)
#> # A tibble: 5 x 9
#>         pval        obs     x      exp   upper   lower  logp selected bonferroni
#>        <dbl>      <dbl> <int>    <dbl>   <dbl>   <dbl> <dbl> <lgl>    <lgl>     
#> 1    1.82e-8    1.82e-8     6 0.000599 2.20e-4 0.00117  7.74 FALSE    TRUE      
#> 2    2.20e-7    2.20e-7     7 0.000699 2.81e-4 0.00130  6.66 FALSE    TRUE      
#> 3    2.24e-7    2.24e-7     8 0.000799 3.45e-4 0.00144  6.65 FALSE    TRUE      
#> 4    2.39e-7    2.39e-7     9 0.000899 4.11e-4 0.00157  6.62 FALSE    TRUE      
#> 5    5.61e-7    5.61e-7    10 0.000999 4.79e-4 0.00171  6.25 FALSE    TRUE

These variants are important, but they aren't selected because they fall into a bin with a lot of other variants with less significant p-values.

One suggestion for how to fix this is to modify thinPoints keep all variants with p < p_threshold, and then applying the thinning to only those with p > p_threshold. I'm not sure what value p_threshold should be, though -- 5e-8? 5e-9? bonferroni significance?

@mconomos
Copy link
Contributor

I like the idea of keeping all variants with p < p_threshold, but I think the threshold would need to be bigger than 5e-8 to avoid the weird "gaps" in the plots. I think we could actually use a value like 1e-4 and be OK. Under the null, the p-values are uniformly distributed on (0,1), so even with 40,000,000 variants we'd only expect about 4,000 to have p < 1e-4. Of course, it will be more if there is actual signal, so maybe we shouldn't go quite that high, but I don't think it should be genome-wide significance level.

Another option would just be to increase thin_nbins from 10 to something like 50. I've seen that work OK in other analyses, but your suggestion may be the better solution.

@amstilp
Copy link
Contributor Author

amstilp commented Mar 31, 2020

I remember thinking about this for OLGA analyses. It looks like we used p = 1e-2 as a threshold there, which seems large to me. We could also think about a dynamic threshold, something like 0.05 / n_variants * factor, and choose the default factor such that the threshold is ~1e-4 for a genome-wide (single variant?) analysis. A dynamic threshold might work better when running a single chromosome, for example.

@mconomos
Copy link
Contributor

mconomos commented Apr 1, 2020

A dynamic approach seems reasonable to me. Based on your formula, if we assume 5e-8 comes from 1,000,000 variants, then the factor is 2000. Or we could just say 100/n_variants.

With 1,000,000 variants the threshold would be 1e-4
With 10,000,000 variants it would be 1e-5
with 100,000,000 variants (I think this is roughly TOPMed scale) it would be 1e-6

Does this seem reasonable? Probably? This should always correspond to an expectation of 100 variants below this threshold under the null hypothesis. Of course more variants when there are signals, but that makes sense. We should probably test on a couple of examples.

@DanielEWeeks
Copy link

We encountered this thinPoints issue too. Have any steps been taken to resolve this issue?

Instead of dynamic approach, I'd favor a controllable approach, where we can set our own -log10(p) threshold in the config file above which no thinning at all of the most significant results is done. That is, I want to be sure that the resulting plot shows us all signals above the chosen threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants