The BioMuta pipeline gathers mutation data from various sources and combines them into a single dataset under common field structure.
The sources included in the current version of BioMuta are:
BioMuta gathers mutation data for the following cancers:
- DOID:4045 / muscle cancer
- DOID:10283 / prostate cancer
- DOID:3565 / meningioma
- DOID:3277 / thymus cancer
- DOID:5041 / esophageal cancer
- DOID:263 / kidney cancer
- DOID:2394 / ovarian cancer
- DOID:175 / vascular cancer
- DOID:9256 / colorectal cancer
- DOID:4606 / bile duct cancer
- DOID:11934 / head and neck cancer
- DOID:2531 / hematologic cancer
- DOID:1319 / brain cancer
- DOID:1785 / pituitary cancer
- DOID:9253 / gastrointestinal stromal tumor
- DOID:5158 / pleural cancer
- DOID:184 / bone cancer
- DOID:1612 / breast cancer
- DOID:11239 / appendix cancer
- DOID:1781 / thyroid cancer
- DOID:2174 / ocular cancer
- DOID:0060073 / lymphatic system cancer
- DOID:10534 / stomach cancer
- DOID:8618 / oral cavity cancer
- DOID:3953 / adrenal gland cancer
- DOID:1793 / pancreatic cancer
- DOID:1192 / peripheral nervous system neoplasm
- DOID:2998 / testicular cancer
- DOID:1324 / lung cancer
- DOID:3121 / gallbladder cancer
- DOID:4159 / skin cancer
- DOID:3571 / liver cancer
- DOID:363 / uterine cancer
- DOID:3070 / malignant glioma
- DOID:4362 / cervical cancer
- DOID:11054 / urinary bladder cancer
BioMuta pipeline comprises two steps:
- Download
Downloads mutation lists from each source. TBA: cBioPortal fields, cBioPortal studies
- Convert
Formats all resources to the BioMuta standard for both data and field structure.
- Clone the Repository:
git clone https://github.com/GW-HIVE/biomuta-old.git
- Set Up a Virtual Environment (Optional):
While in the root directory, run source env/bin/activate
.
- Install Dependencies: (to be implemented)
pip3 install -r requirements.txt
Before running the scripts, please modify the paths in the config.json
file to match your local machine setup. The file contains important directory paths used by the scripts, including paths for downloading data, storing results, and other resources.
- Open the
config.json
file located in the root of the repository. - Modify the following fields to point to the correct directories on your machine:
"downloads"
: Path to the directory where raw data will be saved."generated_datasets"
: Path to the directory where processed datasets will be stored."mapping"
: Path to the directory containing mapping files. You will only need to modify the path to theroot
directory since mapping files are already included in the repository.
Example config.json
:
{
"downloads": "/path/to/downloads",
"generated_datasets": "/path/to/generated_datasets",
"mapping": "/root/pipeline/convert_step2/mapping"
}
- Go to
pipeline/download_step1/cbioportal
Script execution order (some scripts will be moved into appropriate directories later) 1 - fetch_mutations.sh 2 - cancer_types.py | integrate_cancer_types.sh
- Extract GRCh37 chromosomic positions and write out in BED format.
- script:
/pipeline/convert_step2/liftover/1_chr_pos_to_bed.py
- output:
/biomuta/generated/datasets/2024_10_22/liftover/hg19positions.bed
- Use UCSC LiftOver to convert to GRCh38 chromosomic positions.
- Run
2_liftover.sh
onhg19positions.bed
with the chain fileucscHg19ToHg38.over.chain
. You will get successfully mapped positions inhg38positions.bed
and unmapped positions inunmapped.bed
. grep -v '^#' unmapped.bed > unmapped_ucsc.bed
to delete comments generated by liftOver.- Run the same script but use unmapped.bed as the input and
ensembl_GRCh37_to_GRCh38.chain
as the chain file. You will get successfully mapped positions inhg38positions_unmapped_by_uscs.bed
and unmapped positions inunmapped_ensembl.bed
.
- Run
- Grab the corresponding ENSEMBL transcript ID.
- For every chromosomal position in the input BED file, in which transcript does it fall? See which range each position falls into in the gff3 file. Grab the corresponding ENSP.
- Map to ENSEMBL protein ID.
- Map to UniProt Canonical Accession Numbers.
The liftover from GRCh37 to GRCh38 was performed with the LiftOver command line tool developed by UCSC (insert link).
After cloning this repo, you will need to set the parameters given in pipeline/config.json.
The following must be available on your server:
- Node.js and npm
- docker
After cloning this repo, you will need to set the parameters given in cof/config.json. The "server" paramater can be "tst" or "prd" for test or production server respectively. The "app_port" is the port in the host that should map to docker container for the app.
From the "app" subdirectory, run the python script given to build and start container:
python3 create_app_container.py -s {DEP}
docker ps --all
The last command should list docker all containers and you should see the container you created "running_hivelab_app_{DEP}". To start this container, the best way is to create a service file (/usr/lib/systemd/system/docker-hivelab-app-{DEP}.service), and place the following content in it.
[Unit]
Description=Glyds APP Container
Requires=docker.service
After=docker.service
[Service]
Restart=always
ExecStart=/usr/bin/docker start -a running_hivelab_app_{DEP}
ExecStop=/usr/bin/docker stop -t 2 running_hivelab_app_{DEP}
[Install]
WantedBy=default.target
This will allow you to start/stop the container with the following commands, and ensure that the container will start on server reboot.
$ sudo systemctl daemon-reload
$ sudo systemctl enable docker-hivelab-app-{DEP}.service
$ sudo systemctl start docker-hivelab-app-{DEP}.service
$ sudo systemctl stop docker-hivelab-app-{DEP}.service
To map the APP and API containers to public domains (e.g. www.hivelab.org and api.hivelab.org), add apache VirtualHost directives. This VirtualHost directive can be in a new f ile (e.g. /etc/httpd/conf.d/hivelab.conf).
<VirtualHost *:443>
ServerName www.hivelab.org
ProxyPass / http://127.0.0.1:{APP_PORT}/
ProxyPassReverse / http://127.0.0.1:{APP_PORT}/
</VirtualHost>
where {APP_PORT} and {API_PORT} are your port for the APP and API ports in conf/config.json file. You need to restart apache after this changes using the following command:
$ sudo apachectl restart