Skip to content

Commit

Permalink
restructure installation doc
Browse files Browse the repository at this point in the history
  • Loading branch information
pdiakumis committed Jun 15, 2024
1 parent 3600543 commit 3e10f35
Showing 1 changed file with 91 additions and 222 deletions.
313 changes: 91 additions & 222 deletions pcgrr/vignettes/installation.Rmd
Original file line number Diff line number Diff line change
@@ -1,13 +1,18 @@
---
title: "Installation"
output: rmarkdown::html_document

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(comment = "", collapse = TRUE)
```


```{r load_pkgs, include=FALSE, echo=FALSE, message=FALSE, warning=FALSE}
require(glue, include.only = "glue")
```


```{r vars, echo=FALSE}
Sys.setenv(VEP_VERSION = "112")
Sys.setenv(PCGR_VERSION = "1.4.1.9014")
Expand All @@ -18,64 +23,60 @@ BUNDLE_VERSION <- Sys.getenv("BUNDLE_VERSION")
```

```{r funcs, echo=FALSE}
bundle_link <- function(v, hg) {
glue("[{hg} - {v}](https://insilico.hpc.uio.no/pcgr/pcgr_ref_data.{v}.{hg}.tgz)")
bundle_link <- function(hg) {
v <- BUNDLE_VERSION
glue("https://insilico.hpc.uio.no/pcgr/pcgr_ref_data.{v}.{hg}.tgz")
}
```


The PCGR workflow has several data requirements and software installation options.

- Data requirements:
- Sample-specific inputs (e.g. somatic variant calls in VCF format)
- Reference bundle (e.g. CIViC, CGI, TCGA)
- Ensembl VEP data cache
## Data

- Software options:
- Conda
- Docker
- Singularity/Apptainer
PCGR requires the following data:

## Data
- Sample-specific inputs (e.g. somatic variant calls in VCF format)
- Reference bundle (e.g. CIViC, CGI, TCGA)
- Ensembl VEP data cache

PCGR supports GRCh37 and GRCh38 sample-specific inputs. The reference bundle and
VEP data cache need to match the chosen human genome assembly.
PCGR supports the GRCh37 and GRCh38 human genome assemblies. All the data above
need to match the chosen assembly.

### 1. Reference Bundle

Reference bundles are generated semi-automatically by the author and versioned
based on their release date. Keep in mind that the bundles support only certain
Ensembl VEP versions. The genome-specific bundle is available from below (size: ~5G):

- `r bundle_link(v = BUNDLE_VERSION, hg = "grch37")`
- `r bundle_link(v = BUNDLE_VERSION, hg = "grch38")`
Reference bundles are generated semi-automatically (by the PCGR author) and
are versioned based on their release date. Keep in mind that the bundles support
only certain Ensembl VEP versions. The genome-specific bundles
(**v`r BUNDLE_VERSION`**) can be downloaded directly from below (size: ~5G):

**Tip**: The `data/grch3x/.PCGR_BUNDLE_VERSION` file indicates the bundle version.
| Assembly | Download Link |
|----------|---------------------------|
| GRCh38 | `r bundle_link("grch38")` |
| GRCh37 | `r bundle_link("grch37")` |

<details>
<summary>Bash example</summary>
**Tip**: The `data/grch3x/.PCGR_BUNDLE_VERSION` file within the downloaded bundle
indicates the bundle version for reporting purposes.

#### Bash Example

```{bash echo=FALSE}
echo "BUNDLE_VERSION=\"${BUNDLE_VERSION}\""
```

```{bash eval=FALSE}
```bash
GENOME="grch38" # or "grch37"
BUNDLE_VERSION="20240612"
BUNDLE="pcgr_ref_data.${BUNDLE_VERSION}.${GENOME}.tgz"

wget https://insilico.hpc.uio.no/pcgr/${BUNDLE}
gzip -dc ${BUNDLE} | tar xvf -

mkdir ${BUNDLE_VERSION}
mv data/ ${BUNDLE_VERSION}
```

</details>

### 2. VEP Cache

Ensembl [VEP][vep-web] requires a data cache which is available from the Ensembl
[VEP][vep-web] requires a data cache which is available from the Ensembl
[FTP site][ensembl-ftp] (search there for files starting with `homo_sapiens_vep_`).
We currently support Ensembl VEP version `112`.
We currently support Ensembl VEP **v`r VEP_VERSION`**.

**Tip**: PCGR needs to be pointed to the _parent_ directory containing
the downloaded `homo_sapiens/xyz_GRCh3x/` cache, which is usually called `.vep` if
Expand All @@ -85,32 +86,80 @@ you've followed the VEP cache [download instructions][vep-cache].
[ensembl-ftp]: https://ftp.ensembl.org/pub/release-112/variation/indexed_vep_cache/
[vep-cache]: https://asia.ensembl.org/info/docs/tools/vep/script/vep_cache.html#cache
- Bash example:
#### Bash Example
```{bash echo=FALSE}
echo "VEP_VERSION=\"${VEP_VERSION}\""
```
```bash
GENOME="GRCh38" # or "GRCh37"
VEP_VERSION="112"
CACHE="homo_sapiens_vep_${VEP_VERSION}_${GENOME}.tar.gz"
wget https://ftp.ensembl.org/pub/release-${VEP_VERSION}/variation/indexed_vep_cache/${CACHE}
gzip -dc ${CACHE} | tar xvf -
```
-----------------------------
## Software
The PCGR workflow can be installed using [Conda][conda-web], [Docker][docker-web],
or [Singularity/Apptainer][apptainer-web].
The PCGR workflow can be installed using [Docker][docker-web],
[Singularity/Apptainer][apptainer-web] or [Conda][conda-web].
[conda-web]: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
[docker-web]: https://docs.docker.com/
[apptainer-web]: https://apptainer.org/docs/user/latest/index.html
### Conda
### A. Docker
The Docker image is available on [DockerHub](https://hub.docker.com/r/sigven/pcgr/tags).
Pull the latest **v`r PCGR_VERSION`** image with:
```{r echo=FALSE}
glue("docker pull sigven/pcgr:{PCGR_VERSION}")
# might need to specify platform
# docker pull --platform=amd64 sigven/pcgr:${PCGR_VERSION}
```
#### Example Run
```bash
docker container run -it --rm \
-v /Users/you/projects/.vep:/mnt/.vep
-v /Users/you/projects/bundle:/mnt/bundle \
-v /Users/you/projects/pcgr_inputs:/mnt/pcgr_inputs \
-v /Users/you/projects/pcgr_outputs:/mnt/pcgr_outputs \
sigven/pcgr:1.4.1.9014 \
pcgr \
--input_vcf "/mnt/pcgr_inputs/tumor_sample.BRCA.vcf.gz" \
--vep_dir "/mnt/.vep" \
--refdata_dir "/mnt/bundle" \
--output_dir "/mnt/pcgr_outputs" \
--genome_assembly "grch38" \
--sample_id "SampleB" \
--assay "WGS" \
--vcf2maf
```
### B. Singularity/Apptainer
```{r echo=FALSE}
glue("apptainer pull oras://ghcr.io/sigven/pcgr:{PCGR_VERSION}.singularity")
```
There is conda support for both Linux and macOS machines:
### C. Conda
<details>
<summary>Linux</summary>
There is Conda support for both Linux and macOS machines.
The following process can take anywhere from 10 up to 40 minutes when installing
from scratch, mostly depending on the user's and server's internet connection.
Most of the time is spent on downloading the `{BSgenome.Hsapiens.UCSC.hg19}` and
`{BSgenome.Hsapiens.UCSC.hg38}` R packages (which happens at the very end of the
conda environment creation).
#### Linux
```bash
# set up variables
Expand All @@ -127,12 +176,9 @@ conda activate ./pcgr_conda/pcgr
pcgr --version
```
</details>
<details>
<summary>macOS</summary>
#### macOS
For macOS M1 machines, you need to have `CONDA_SUBDIR=osx-64` before the
For macOS M1 machines, you need to include `CONDA_SUBDIR=osx-64` before the
`conda create` command - see
<https://github.com/conda-forge/miniforge/issues/165#issuecomment-860233092>:
Expand All @@ -150,180 +196,3 @@ conda activate ./pcgr_conda/pcgr
# test that it works
pcgr --version
```
</details>
### Docker
See the [Docker setup](#dockersetup) section for more details.
```bash
PCGR_VERSION="1.4.1.9014"
docker pull sigven/pcgr:${PCGR_VERSION}
# might need to specify platform
# docker pull --platform=amd64 sigven/pcgr:${PCGR_VERSION}
```
### Singularity/Apptainer
```bash
PCGR_VERSION="1.4.1.9014"
apptainer pull oras://ghcr.io/sigven/pcgr:${PCGR_VERSION}.singularity
```
<br>
<hr>
<br>
<a name="step1"></a>
### STEP 2: Set up Conda or Docker
Step 2 depends on if you want to use Conda or Docker:
- For Conda, continue reading the [PCGR Conda setup](#condasetup).
- For Docker, skip to the [PCGR Docker setup](#dockersetup).
<a name="condasetup"></a>
### Option 1: Conda
#### a) Miniconda and conda
Download and install the Miniconda installer from <https://docs.conda.io/en/latest/miniconda.html>:
- Make sure to download the Linux or MacOSX script according to which platform you're currently on.
- Run `bash miniconda.sh` and follow the prompts (it should be okay to accept the defaults, unless you want to choose a different
installation location than the default `~/miniconda3`).
- Exit your current terminal session and open a new one. You should now notice something like a `(base)` string as a
prefix in your terminal prompt. This means that you're in the `base` conda environment, and you're ready to start
installing the conda environments for PCGR.

```bash
PLATFORM="MacOSX" # or "Linux"
MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda3-latest-${PLATFORM}-x86_64.sh"
wget ${MINICONDA_URL} -O miniconda.sh && chmod +x miniconda.sh
bash miniconda.sh
```

```text
# exit terminal and open new one - you should now see:

# as of May 2024
(base) $ conda --version
conda 24.5.0
```

#### b) Create PCGR conda environments

The `conda/env/lock` directory in the PCGR codebase contains two `.lock` files which
can be used to create the required conda environments for the Python component
(`pcgr`) and the R components (`pcgrr` (and `cpsr`)). We install the conda
dependencies for these two environments in a local `conda` directory in the
following example:

```bash
cd /Users/you/dir4/conda
PLATFORM="osx-64" # or "linux-64"

PCGR_VERSION="1.4.1.9014"
PCGR_REPO="https://raw.githubusercontent.com/sigven/pcgr/v${PCGR_VERSION}/conda/env/lock/"
PLATFORM="linux" # or "osx"

conda create --prefix ./pcgr --file ${PCGR_REPO}/pcgr-${PLATFORM}-64.lock
conda create --prefix ./pcgrr --file ${PCGR_REPO}/pcgrr-${PLATFORM}-64.lock

## Alternatively, for installing in your central conda directory, use the following:
# conda create --name pcgr --file ${PCGR_CONDA_ENV_DIR}/lock/pcgr-${PLATFORM}.lock
# conda create --name pcgrr --file ${PCGR_CONDA_ENV_DIR}/lock/pcgrr-${PLATFORM}.lock

## For MacOS M1, you need to have 'CONDA_SUBDIR=osx-64' before the conda command, i.e.:
# CONDA_SUBDIR=osx-64 conda create --prefix [...] --file [...]
```

The above process takes 20-30 minutes when installing from scratch. Most of the time
is spent on downloading the
{BSgenome.Hsapiens.UCSC.hg19} and {BSgenome.Hsapiens.UCSC.hg38} R packages
(and yes, for simplicity we download both packages).
In the end, confirm your conda environments have been installed correctly
(notice how the paths are different to the `base` env installation after using the
`--prefix` option above):

```text
$ (base) conda env list
# conda environments:
#
base * /Users/you/miniconda3
pcgr /Users/you/dir4/conda/pcgr
pcgrr /Users/you/dir4/conda/pcgrr
```

#### c) Activate pcgr conda environment

You need to activate the `conda/pcgr` conda environment, and test that it works
correctly with e.g. `pcgr --version`:

```text
$ cd /Users/you/dir4/conda
(base) $ conda activate ./conda/pcgr
# note how the full path to the locally installed conda environment is now displayed

(/Users/you/dir4/conda) $ which pcgr
/Users/you/dir4/conda/pcgr/bin/pcgr

(/Users/you/dir4/conda) $ pcgr --version
pcgr X.X.X

(/Users/you/dir4/conda) $ which pcgrr.R
/Users/you/dir4/conda/pcgr/bin/pcgrr.R
```

You should now be all set up to run PCGR! Continue on to [an example run](running.html#example-run).

<a name="dockersetup"></a>

### Option 2: Docker

#### a) Install Docker

For installing Docker, follow the instructions at <https://docs.docker.com/engine/install/>
for your Linux or MacOSX machine.

#### b) Download PCGR Docker Image

- Pull the [PCGR Docker image](https://hub.docker.com/r/sigven/pcgr/tags) from
DockerHub with: `docker pull sigven/pcgr:X.X.X`

#### c) Run PCGR Docker Container

If you are familiar with working with Docker volumes (<https://docs.docker.com/storage/volumes/>)
you can run PCGR using Docker instead of conda using the `-v <host>:<container>` Docker option.
You'll need to map your PCGR inputs to Docker container paths.
For example, say you have the input VCF `sampleX.vcf.gz` stored in the
directory `/Users/you/project1`. You would need to supply Docker with a
`--volume` (or `-v`) option mapping the directory of that VCF with
a directory inside the Docker container, e.g. `/home/input_vcf_dir`.
That would become: `-v /Users/you/project1:/home/input_vcf_dir`
(note the `:` separating your directory from the container's directory).

Then your command would look something like this:

```bash
docker container run -it --rm \
-v /Users/you/dir0/vep:/root/vep
-v /Users/you/dir1/data:/root/pcgr_refdata \
-v /Users/you/dir2/pcgr_inputs:/root/pcgr_inputs \
-v /Users/you/dir3/pcgr_outputs:/root/pcgr_outputs \
sigven/pcgr:1.4.1.9014 \
pcgr \
--input_vcf "/root/pcgr_inputs/tumor_sample.BRCA.vcf.gz" \
--vep_dir "/root/vep/.vep" \
--refdata_dir "/root/pcgr_refdata" \
--output_dir "/root/pcgr_outputs" \
--genome_assembly "grch38" \
--sample_id "SampleB" \
--assay "WGS" \
--vcf2maf
```

0 comments on commit 3e10f35

Please sign in to comment.