Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore cBioPortal #5

Open
5 tasks done
jeet-vora opened this issue Sep 17, 2024 · 4 comments
Open
5 tasks done

Explore cBioPortal #5

jeet-vora opened this issue Sep 17, 2024 · 4 comments
Assignees

Comments

@jeet-vora
Copy link

jeet-vora commented Sep 17, 2024

  • Explore cBioPortal
  • See what data is available
  • What is their license
  • How can we download the data
  • etc
@mariacuria
Copy link
Contributor

What data is in cBioPortal?

What goes into cBioPortal (ref: introductory slides)

  • Data sources
    • TCGA
      • NIH-sponsored research project containing mutation, copy number alteration, mRNA expression data from primary, untreated tumors
    • ICGC
    • Count Me In
    • CCLE (Cancer Cell Line Encyclopedia)
    • AACR (American Association for Cancer Research) & Project Genie (Genomics Evidence Neoplasia Information Exchange)
      • clinical sequencing data from 19 cancer centers worldwide containing mutation and some copy number alteration data from primary and metastatic tumors pre- and post-treatment
  • Background biological data
    • networks
    • 3D protein structure
  • Curated effect and therapy implications
    • OncoKB (Precision Oncology Knowledge Base)
    • CIViC
    • My Cancer Genome
  • Predicted functional effect
    • mutationassessor.org
    • PolyPhen-2
  • Variant recurrence
    • COSMIC
    • Cancer Hotspots

Available data types

  • Omic data
    • non-synonymous mutations
    • fusions
    • DNA copy-number data (putative, discrete values per gene, e.g. "deeply deleted" or "amplified", as well as log2 or linear copy number data)
    • mRNA and microRNA expression data
    • protein-level and phosphoprotein level data (RPPA or mass spectrometry based)
    • DNA methylation data
  • De-identified Clinical data
    • treatments
    • survival

Available cancer studies

  • Adrenal Gland Cancer (DOID:3953)
  • Brain Cancer (DOID:1319)
  • Bone Cancer (DOID:184)
  • Breast Cancer (DOID:1612)
  • Cardiovascular Cancer (DOID:176)
  • Cell Type Cancer (DOID:0050687)
  • Cervical Cancer (DOID:4362)
  • Colorectal Cancer (DOID:9256)
  • Endocrine Organ Benign Neoplasm (DOID:0060089)
  • Esophageal Cancer (DOID:5041)
  • Gastroesophageal Cancer (DOID:0080374)
  • Gastrointestinal System Benign Neoplasm (DOID:0050624)
  • Head and Neck Cancer (DOID:11934)
  • Hematologic Cancer (DOID:2531)
  • Hepatobiliary System Cancer (DOID:0080355)
  • Intestinal Cancer (DOID:10155)
  • Kidney Cancer (DOID:263)
  • Liver Cancer (DOID:3571)
  • Lung Cancer (DOID:1324)
  • Musculoskeletal System Cancer (DOID:0060100)
  • Nervous System Benign Neoplasm (DOID:0060115)
  • Oral Cavity Cancer (DOID:8618)
  • Ovarian Cancer (DOID:2394)
  • Pancreatic Cancer (DOID:1793)
  • Peripheral Nervous System Neoplasm (DOID:1192)
  • Prostate Cancer (DOID:10283)
  • Sensory System Cancer (DOID:0060116)
  • Skin Cancer (DOID:4159)
  • Stomach Cancer (DOID:10534)
  • Testicular Cancer (DOID:2998)
  • Thoracic Cancer (DOID:5093)
  • Thyroid Gland Cancer (DOID:1781)
  • Urinary Bladder Cancer (DOID:11054)
  • Uterine Cancer (DOID:363)

License

  • The cBioPortal software is available under an open source license via GitHub. (ref).
  • Licensed under the AGPL license
    • Free to download and use
    • Modifications welcome

How to download the data (ref)

  • Datasets Page
    A zip file for each study on cbioportal.org can be download from the Datasets Page. One can also use the R client cBioPortalData to programmatically download all of these files.
  • Datahub
    The files for each study are also available from our datahub repository. This is basically the extracted version of the zip files in the Datasets Page. Note that this is a git LFS repo so if you are familiar with git you might prefer using this option.
  • API and API Clients
    Besides downloading all the study data one can also request slices of the data using the API. A slice of the data could e.g. be "give me all the mutation data for one patient" or "get me all EGFR mutations for a particular group of samples". There are API clients available in a variety of languages including bash, R and Python. See for more information the API documentation. cBioPortal provides a REST API for programmatic access to the data (ref)
    • R clients
      • cBioPortalData (recommended)
      • cbioportalR (recommended)
    • Python client
      • bravado
      • cbio_py

@mariacuria
Copy link
Contributor

UPD: figure out download API

@mariacuria
Copy link
Contributor

UPD task list

  • Compile all cBioPortal mutation files into one file or find a file that compiles all studies.
  • Create an Excel sheet with all headers from cBioPortal mutation data with 4-5 examples. Compare them with BioMuta headers (rf slides). We want to see as many headers as possible like in BioMuta.
  • Make note of differences, e.g. additional mapping is needed from chr to prot position, or cBioPortal mentions gene name but not UniProt ID... Pay attention to mutation type, we are only interested in non-synonymous mutations (those that lead to change in AA or stop-codon). Also look out for info on how the frequency is calculated (patient or allelic), cBioPortal gives you clinical data: # of patients, # of samples, and cancer type. Based on that, we want to calculate the patient frequency. How many patients have the same mutations for a given cancer? E.g. lung cancer, mutation at pos 119 in EGFR. How many patients have this mutation in EGFR? We need to see whether this info is available or not.
  • Cancer types listed are based on Disease Ontology: skin cancer = melanoma… same disease, different name. We want to uniformize cancer names using Cancer slim DO on disease ontology (see paper).
  • Does cBioPortal have more data than BigQuery + dbGaP?

@mariacuria
Copy link
Contributor

mariacuria commented Dec 6, 2024

  • Show how many mutation sites are common, unique to cbio, unique to biomuta
  • Where do the unique sites come from, which source?
  • How many sites (and in which cancers) will we lose in the next release of GlyGen if we switch to the cbio pipeline?
  • How many annotations will we lose? E.g. we retain a site at position 22 in EGFR but BioMuta says it's a skin cancer while cBio says it's an ovarian cancer. This is a change in annotation. Come up with a way to showcase this.
  • What are the new cancer terms I found?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants