Skip to content

Tutorial 3: maxFDR Calculation

huilisabrina edited this page Jul 17, 2018 · 2 revisions

In this tutorial, we will walk through an example of calculating the approximate upper bound on the FDR of MTAG results. The theoretical framework and more detailed discussion can be found in Supplement Information Section 1.4 of Turley et al. (2017). Note that MTAG does not automatically filter the traits based on FDR.

We will demonstrate using a subset of the original GWAS results on neuroticisim and subjective well-being from Okbay et al. (2016). These sumstats have been preprocessed and formatted for running mtag. Please refer to the Sample GWAS Results and Data Format section in in the first tutorial for initial cleaning of the data inputs.

Runing maxFDR with MTAG

To run the maxFDR calculation, invoke --fdr when running MTAG in command line. With minimal options:

  python ../mtag/mtag.py  \
  --sumstats ../data/1_OA2016_hm3samp_NEUR.txt,../data/1_OA2016_hm3samp_SWB.txt \
  --out ../output/3_maxFDR/3.1_mtag_maxFDR_basic_NS \
  --stream_stdout \
  --fdr \
  --stream_stdout

Advanced Options

The following options are available for running maxFDR calculations. It's also helpful to review the comprehensive list of options via python mtag.py -h and the section "Max FDR calculation":

  • --n_approx: Depending on the number of traits and the fineness of the grid that are used to make the approximation, the computing time can be substantial. In this case, we recommend using the option --n_approx to speed up the FDR calculation by replacing the sample size of a SNP by the average sample size across SNPs (for each trait). Updates: We will make this a default option. Turning it off requires numpy version to be greater than 1.13.0.
  • --grid_file: Customized set of grid points to calculate FDR.
  • --intervals: Number of intervals to partition [0,1]. Used to adjust the coarseness of grid.
  • --cors: Number of threads/cores used to parallelize the computation.
  • --fit_ss: Another option to speed up the grid search process - If specified, mtag will estimate the prior probability that a SNP is null for each trait and then restrict the grid search to the set of probability vectors ("sum") that sum to the prior null for each trait. This is useful for restricting the search space of large-dimensional traits. Notes: This option is still under testing.

Updates: Alternatively, you can specify the path to the MTAG result files and skip the MTAG estimation step. This allows you to only perform maxFDR calculation without going through MTAG again. To do this, you need to repull the MTAG repository (updated since July.xx, 2018), and use the fdr_only.py as the following:

  python ../mtag/fdr_only.py  \
  --mtag_path ../data/1_OA2016_hm3samp_NEUR.txt,../data/1_OA2016_hm3samp_SWB.txt

The same set of "Max FDR Calculation" options are available in this setting. The output files will have a new prefix [MTAG_PREFIX].FDR to avoid confusion with the previously created [MTAG_PREFIX].log file.

maxFDR output

In addition to the standard mtag output as explained in Tutorial 1, specifying --fdr will make MTAG output two more files:

  • [PREFIX]_fdr_mat.txt records the maximum false discovery rate for each trait in each state.

    0.000000000000000000e+00        0.000000000000000000e+00
    0.000000000000000000e+00        2.412599223162292111e-03
    0.000000000000000000e+00        3.846085678952126680e-03
    0.000000000000000000e+00        4.369596250455918054e-03
    0.000000000000000000e+00        4.120088494741054597e-03
    6.910217361619816305e-04        0.000000000000000000e+00
    6.836514978170410255e-04        2.410876601205827474e-03
    6.724179301230901167e-04        3.776698888290617052e-03
    ...
    1.606344506405588525e-04        8.960390758686683941e-04
    1.636829652551818101e-04        1.066110509474288925e-04
    1.029613181802874888e-04        8.910495939743934735e-05
    7.577909678908985259e-05        7.238905719667969871e-04
    7.679184008810661379e-05        4.346938850577793643e-05
    4.120803680373065888e-05        3.306298969392538757e-05
    1.266147392770030122e-05        8.980852269155806025e-06
    
  • [PREFIX]_prob_grid.txtrecords the probability grid or the mixture weights of each state. # rows: number of gridpoints to search, which is a function of the --intervals specified and the number of traits. # columns: number of states which always equals 2^T.

    0.000000000000000000e+00        0.000000000000000000e+00        0.000000000000000000e+00        1.000000000000000000e+00
    0.000000000000000000e+00        0.000000000000000000e+00        1.000000000000000056e-01        9.000000000000000222e-01
    0.000000000000000000e+00        0.000000000000000000e+00        2.000000000000000111e-01        8.000000000000000444e-01
    0.000000000000000000e+00        0.000000000000000000e+00        2.999999999999999889e-01        6.999999999999999556e-01
    0.000000000000000000e+00        0.000000000000000000e+00        4.000000000000000222e-01        5.999999999999999778e-01
    0.000000000000000000e+00        1.000000000000000056e-01        0.000000000000000000e+00        9.000000000000000222e-01
    0.000000000000000000e+00        1.000000000000000056e-01        1.000000000000000056e-01        8.000000000000000444e-01
    ...
    5.999999999999999778e-01        1.000000000000000056e-01        0.000000000000000000e+00        2.999999999999999889e-01
    6.999999999999999556e-01        0.000000000000000000e+00        0.000000000000000000e+00        2.999999999999999889e-01
    6.999999999999999556e-01        0.000000000000000000e+00        1.000000000000000056e-01        2.000000000000000111e-01
    6.999999999999999556e-01        1.000000000000000056e-01        0.000000000000000000e+00        2.000000000000000111e-01
    8.000000000000000444e-01        0.000000000000000000e+00        0.000000000000000000e+00        2.000000000000000111e-01
    9.000000000000000222e-01        0.000000000000000000e+00        0.000000000000000000e+00        1.000000000000000056e-01
    

The summary of maxFDR results are reported at the last section of the MTAG log file. If fdr_only.py is used, the FDR log file will look like the following:

  T=2
  Number of gridpoints to search: 65
  Performing grid search using 1 cores.
  Grid search: 10.0 percent finished for . Time:  0.000 min
  Grid search: 20.0 percent finished for . Time:  0.001 min
  Grid search: 30.0 percent finished for . Time:  0.001 min
  Grid search: 40.0 percent finished for . Time:  0.002 min
  Grid search: 50.0 percent finished for . Time:  0.003 min
  Grid search: 60.0 percent finished for . Time:  0.003 min
  Grid search: 70.0 percent finished for . Time:  0.004 min
  Grid search: 80.0 percent finished for . Time:  0.004 min
  Grid search: 90.0 percent finished for . Time:  0.005 min
  Grid search: 100.0 percent finished for . Time:         0.005 min
  Saved calculations of fdr over grid points in ../output/3_maxFDR/3.2_mtag_maxFDR_approx_NS_fdr_mat.txt
  <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
  grid point indices for max FDR for each trait: [12  3]
  Maximum FDR
  Max FDR of Trait 1: 0.00109655427181 at probs = [ 0.   0.3  0.   0.7]
  Max FDR of Trait 2: 0.00436959625046 at probs = [ 0.   0.   0.3  0.7]
  <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
  Completed FDR calculations.
Clone this wiki locally