Skip to content

Full workflow to perform a Mendelian Randomization analysis with QTL using Nextflow

Notifications You must be signed in to change notification settings

juliaapolonio/Causeway

Repository files navigation

Overview

juliaapolonio/Causeway is a pipeline for Mendelian Randomization and sensitivity analysis between a phenotype GWAS sumstats and QTL data.

Previous MR tools have been used to analyze a small number of exposure-outcome combinations, but they are not optimized to perform with a large number of combinations such as in a genome-wide QTL screening. In this context, Causeway was built to enable MR + sensitivity analysis in a user-friendly and computationally effective way. The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. As a future improvement, when possible, the local modules will be submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

Pipeline summary

pipeline summary

Generalized Summary Mendelian Randomization (GSMR)

This is the main part of the process. It runs GSMR for all Exposures vs the Outcomes and returns: number of IVs, betas, SEs and p-values for each exposure.

Significant gene calculation and filtering

With the results from GSMR, this module calculates the FDR p-value for each gene and filters by it and the number of IVs. This step will substantially decrease the number of tasks for the subsequent processes, and therefore, the execution time of the pipeline.

Two Sample MR (2SMR)

Two Sample MR is an R package that performs Mendelian Randomization and sensitivity analysis. The workflow is configured to run the following 2SMR tests:

  • Inverse Variance Weighted regression;
  • Simple Median regression;
  • Simple mode regression;
  • MR Egger regression;
  • Heterogeneity Egger;
  • Heterogeneity Inverse Variance Weighted;
  • Steiger direction test;
  • Pleiotropy Egger intercept;
  • MR-PRESSO outlier analysis.

Coloc

Coloc is an R package for colocalization analysis. For this workflow, the information retrieved from Coloc are:

  • H3;
  • H4;
  • Most probable causal variant.

Generate output report

This set of processes collects all results from the analysis and merges them into a single .csv file and the results are filtered to a list of candidate drug targets. An HTML report is generated with the analysis highlights.

Quick Start

  1. Install Nextflow (>=22.10.1)

  2. Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (you can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs).

  3. Download the pipeline and test it on a minimal dataset with a single command:

nextflow run juliaapolonio/Causeway -profile test,YOURPROFILE --outdir <OUTDIR>

This will set up 4 genes from eQTLGen cis-eQTL data and 1000 Genomes phase 3 dataset (GRCh37) genotype p-file with a custom Strict Depression summary statistics retrieved from MTAG. Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above).

  • The pipeline comes with config profiles called docker, singularity, podman, shifter, charliecloud and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.
  • Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.
  • If you are using singularity, please use the nf-core download command to download images first, before running the pipeline. Setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
  • If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.
  • Start running your own analysis!
nextflow run juliaapolonio/Causeway \
  --exposure <EXPOSURE_SAMPLESHEET> \
  --outdir <OUTDIR> \
  --ref <REFERENCE_FOLDER> \
  --outcome <OUTCOME_SAMPLESHEET> \
  -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>

Databases and references

MR_workflow needs 3 inputs to run:

  • A reference folder;
  • An Exposure sample sheet;
  • An Outcome file.

Both Exposure and Outcome files should follow the GCTA-Cojo format. The Exposure file should be separated by one gene per file. The reference files should be in PLINK bfile format. Neither the Exposure nor the Outcome files should contain multi-allelic SNPs; the frequency (freq) is the Minor Allele Frequency (MAF). If the Outcome has a small number of SNPs (less than 2M) it is expected that a substantial amount of the tasks will fail due to lack or small number of matching IVs between the Exposure and Outcome data. If the Outcome data has a large number of SNPs (more than 10M) it is still expected that around 10% of GSMR tasks will fail.

Outputs

If successfully run, the workflow should give three files as the main output:

  • summary_report.html is a html report with all analysis highlights;
  • mr_merged_results.csv should contain all analyses results for each GSMR significant gene;
  • significant_genes.txt should give a gene list of all genes that fill the criteria defined in its paper.

Other intermediate outputs are stored in a folder with the corresponding process name and are described in the output section.

Credits

juliaapolonio/Causeway was authored by Julia Apolonio with João Cavalcante and Diego Coelho's assistance, under Dr. Vasiliki Lagou's supervision.

Citations

Causal associations between risk factors and common diseases inferred from GWAS summary data.

Zhihong Zhu, Zhili Zheng, Futao Zhang, Yang Wu, Maciej Trzaskowski, Robert Maier, Matthew R. Robinson, John J. McGrath, Peter M. Visscher, Naomi R. Wray & Jian Yang

Nature Communications 2018 Jan 15. doi: 10.1038/s41467-017-02317-2

The MR-Base platform supports systematic causal inference across the human phenome.

Hemani G, Zheng J, Elsworth B, Wade KH, Baird D, Haberland V, Laurin C, Burgess S, Bowden J, Langdon R, Tan VY, Yarmolinsky J, Shihab HA, Timpson NJ, Evans DM, Relton C, Martin RM, Davey Smith G, Gaunt TR, Haycock PC, The MR-Base Collaboration.

eLife 2018 Jul. doi: 10.7554/eLife.34408

Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses

Chris Wallace

PLOS Genetics 2020 Apr 20. doi: 10.1371/journal.pgen.1008720

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

Full workflow to perform a Mendelian Randomization analysis with QTL using Nextflow

Resources

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •