BioMuta pipeline

Overview

The BioMuta pipeline gathers mutation data from various sources and combines them into a single dataset under common field structure.

The sources included in the current version of BioMuta are:

cBioPortal

BioMuta gathers mutation data for the following cancers:

DOID:4045 / muscle cancer
DOID:10283 / prostate cancer
DOID:3565 / meningioma
DOID:3277 / thymus cancer
DOID:5041 / esophageal cancer
DOID:263 / kidney cancer
DOID:2394 / ovarian cancer
DOID:175 / vascular cancer
DOID:9256 / colorectal cancer
DOID:4606 / bile duct cancer
DOID:11934 / head and neck cancer
DOID:2531 / hematologic cancer
DOID:1319 / brain cancer
DOID:1785 / pituitary cancer
DOID:9253 / gastrointestinal stromal tumor
DOID:5158 / pleural cancer
DOID:184 / bone cancer
DOID:1612 / breast cancer
DOID:11239 / appendix cancer
DOID:1781 / thyroid cancer
DOID:2174 / ocular cancer
DOID:0060073 / lymphatic system cancer
DOID:10534 / stomach cancer
DOID:8618 / oral cavity cancer
DOID:3953 / adrenal gland cancer
DOID:1793 / pancreatic cancer
DOID:1192 / peripheral nervous system neoplasm
DOID:2998 / testicular cancer
DOID:1324 / lung cancer
DOID:3121 / gallbladder cancer
DOID:4159 / skin cancer
DOID:3571 / liver cancer
DOID:363 / uterine cancer
DOID:3070 / malignant glioma
DOID:4362 / cervical cancer
DOID:11054 / urinary bladder cancer

Features

BioMuta pipeline comprises two steps:

Download

Downloads mutation lists from each source. TBA: cBioPortal fields, cBioPortal studies

Convert

Formats all resources to the BioMuta standard for both data and field structure.

Installation

Clone the Repository:

git clone https://github.com/GW-HIVE/biomuta-old.git

Set Up a Virtual Environment (Optional):

While in the root directory, run source env/bin/activate.

Install Dependencies: (to be implemented)

pip3 install -r requirements.txt

Configuration

Before running the scripts, please modify the paths in the config.json file to match your local machine setup. The file contains important directory paths used by the scripts, including paths for downloading data, storing results, and other resources.

Open the config.json file located in the root of the repository.
Modify the following fields to point to the correct directories on your machine:
- "downloads": Path to the directory where raw data will be saved.
- "generated_datasets": Path to the directory where processed datasets will be stored.
- "mapping": Path to the directory containing mapping files. You will only need to modify the path to the root directory since mapping files are already included in the repository.

Example config.json:

{
  "downloads": "/path/to/downloads",
  "generated_datasets": "/path/to/generated_datasets",
  "mapping": "/root/pipeline/convert_step2/mapping"
}

Usage

Step 1: Download

Go to pipeline/download_step1/cbioportal

Script execution order (some scripts will be moved into appropriate directories later) 1 - fetch_mutations.sh 2 - cancer_types.py | integrate_cancer_types.sh

UniProt Accession Numbers

Extract GRCh37 chromosomic positions and write out in BED format.

script: /pipeline/convert_step2/liftover/1_chr_pos_to_bed.py
output: /biomuta/generated/datasets/2024_10_22/liftover/hg19positions.bed

Use UCSC LiftOver to convert to GRCh38 chromosomic positions.
1. Run 2_liftover.sh on hg19positions.bed with the chain file ucscHg19ToHg38.over.chain. You will get successfully mapped positions in hg38positions.bed and unmapped positions in unmapped.bed.
2. grep -v '^#' unmapped.bed > unmapped_ucsc.bed to delete comments generated by liftOver.
3. Run the same script but use unmapped.bed as the input and ensembl_GRCh37_to_GRCh38.chain as the chain file. You will get successfully mapped positions in hg38positions_unmapped_by_uscs.bed and unmapped positions in unmapped_ensembl.bed.
Grab the corresponding ENSEMBL transcript ID.
1. For every chromosomal position in the input BED file, in which transcript does it fall? See which range each position falls into in the gff3 file. Grab the corresponding ENSP.
Map to ENSEMBL protein ID.
Map to UniProt Canonical Accession Numbers.

Project Structure

License

Acknowledgements

The liftover from GRCh37 to GRCh38 was performed with the LiftOver command line tool developed by UCSC (insert link).

Setting config parameters

After cloning this repo, you will need to set the parameters given in pipeline/config.json.

Deprecated documentation

Requirements

The following must be available on your server:

Node.js and npm
docker

Setting config parameters

After cloning this repo, you will need to set the parameters given in cof/config.json. The "server" paramater can be "tst" or "prd" for test or production server respectively. The "app_port" is the port in the host that should map to docker container for the app.

Creating and starting docker container for the APP

From the "app" subdirectory, run the python script given to build and start container:

python3 create_app_container.py -s {DEP}
docker ps --all

The last command should list docker all containers and you should see the container you created "running_hivelab_app_{DEP}". To start this container, the best way is to create a service file (/usr/lib/systemd/system/docker-hivelab-app-{DEP}.service), and place the following content in it.

[Unit]
Description=Glyds APP Container
Requires=docker.service
After=docker.service

[Service]
Restart=always
ExecStart=/usr/bin/docker start -a running_hivelab_app_{DEP}
ExecStop=/usr/bin/docker stop -t 2 running_hivelab_app_{DEP}

[Install]
WantedBy=default.target

This will allow you to start/stop the container with the following commands, and ensure that the container will start on server reboot.

$ sudo systemctl daemon-reload 
$ sudo systemctl enable docker-hivelab-app-{DEP}.service
$ sudo systemctl start docker-hivelab-app-{DEP}.service
$ sudo systemctl stop docker-hivelab-app-{DEP}.service

Mapping APP and API containers to public domains

To map the APP and API containers to public domains (e.g. www.hivelab.org and api.hivelab.org), add apache VirtualHost directives. This VirtualHost directive can be in a new f ile (e.g. /etc/httpd/conf.d/hivelab.conf).

<VirtualHost *:443>
  ServerName www.hivelab.org
  ProxyPass / http://127.0.0.1:{APP_PORT}/
  ProxyPassReverse / http://127.0.0.1:{APP_PORT}/
</VirtualHost>

where {APP_PORT} and {API_PORT} are your port for the APP and API ports in conf/config.json file. You need to restart apache after this changes using the following command:

$ sudo apachectl restart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BioMuta pipeline

Overview

Features

Installation

Configuration

Usage

Step 1: Download

UniProt Accession Numbers

Project Structure

License

Acknowledgements

Setting config parameters

Deprecated documentation

Requirements

Setting config parameters

Creating and starting docker container for the APP

Mapping APP and API containers to public domains

Files

README.md

Latest commit

History

README.md

File metadata and controls

BioMuta pipeline

Overview

Features

Installation

Configuration

Usage

Step 1: Download

UniProt Accession Numbers

Project Structure

License

Acknowledgements

Setting config parameters

Deprecated documentation

Requirements

Setting config parameters

Creating and starting docker container for the APP

Mapping APP and API containers to public domains