Skip to content

Latest commit

 

History

History
620 lines (515 loc) · 25.8 KB

README.md

File metadata and controls

620 lines (515 loc) · 25.8 KB

Table of contents

  1. Microarray
  2. RNA-Seq
  3. TCGA
  4. Integrating Differential Gene Expression Analyses
  5. Downstream Analyses of Mouse Differential Gene Expression
  6. CCLE
  7. Pan-Cancer Rank Aggregation
  8. ChIP-Seq
  9. Figures
  10. Appendix

Microarray

The microarray is deposited at GEO accession number: GSE143254 with following two subseries:

Normalizing Raw Data

The following command will obtain the raw and normalized microarray values and put them into a sub-directory, named processed_microarray, created within the microarray directory using the Mouse-WG6 annotation file supplied (to map probe IDs to gene IDs/symbols). The normalized values deposited on GEO will be reproduced and, separately, the normalized values used in the paper (where poorly-detected probes were filtered) will also be produced.

Rscript microarray/normalize.r microarray/processed_microarray microarray/annotation/MouseWG-6v2.csv

Downloading normalized data

The following command will download the normalized microarray values (from GEO) into a sub-directory, named processed_microarray, created within the microarray directory. (Use this if you only want to obtain the normalized data as deposited on GEO; keep in mind that the analysis on the paper doesn't use these values).

Rscript microarray/download.r microarray/processed_microarray

Differential Gene Expression (Tissue)

The following command will perform differential gene expression analysis on normalized tissue microarray data deposited in microarray/processed_microarray/tissue/ (described previously). The output of the analysis will be deposited in that same directory. Each tissue will have a set of files produced from the analysis. For example, for liver, we'll have a set of files prefixed with liver containing differential gene expression analysis and expression values. Both the differential gene expression analysis and expression values are done at probe-level and are aggregated at gene-level (for differential expression, Fisher's method is used to combine p-values and the probe with the highest absolute value of log2 fold change is used as effect size; for expression values, the mean of the probe intensities is used for gene-level aggregation).

Rscript microarray/diffexpr.r microarray/processed_microarray microarray/annotation/MouseWG-6v2.csv

RNA-Seq

Pipeline readme here: https://github.com/Yenaled/felsher/blob/master/rnaseq/rna-seq/README.md

TCGA

Obtaining TCGA data from UCSC Xena Toil

Let's download STAR-aligned RSEM-quantified TCGA data from UCSC Xena Toil and put them into the tcga/ directory.

We obtain log2(x+1)-transformed DESeq2-normalized RSEM values (for between-samples comparisons) as follows:

wget --continue https://toil.xenahubs.net/download/TCGA-GTEx-TARGET-gene-exp-counts.deseq2-normalized.log2.gz
zcat < TCGA-GTEx-TARGET-gene-exp-counts.deseq2-normalized.log2.gz|awk -v OFS='\t' 'NR==1{for (i=1; i<=NF; i++) if (!($i ~ /TCGA/)) a[i]; 
     else printf "%s%s", $i, OFS; print ""; next}
     {for (i=1; i<=NF; i++) if (!(i in a)) printf "%s%s", $i, OFS; print "" }' | gzip > tcga/TCGA.rsem.deseq2.log2.tsv.gz

We obtain gene-level log2(x+0.001)-transformed TPM values (for within-samples comparisons) as follows:

wget --continue https://toil.xenahubs.net/download/TcgaTargetGtex_rsem_gene_tpm.gz
zcat < TcgaTargetGtex_rsem_gene_tpm.gz|awk -v OFS='\t' 'NR==1{for (i=1; i<=NF; i++) if (!($i ~ /TCGA/)) a[i]; 
    else printf "%s%s", $i, OFS; print ""; next}
    {for (i=1; i<=NF; i++) if (!(i in a) || i==1) printf "%s%s", $i, OFS; print "" }' | gzip > tcga/TCGA.rsem.TPM.log2.tsv.gz

Alternately, the TCGA normalized RSEM and TPM files are posted here (which you can download and put into the tcga/ directory):

Obtaining TCGA batch identifier data (Tissue Source Site and Batch ID)

The final results of these steps are already in this github repository, but here is how those final results were obtained:

  1. Go to the GDC portal, navigate to the repository, add all the biospecimen files (bcr xml format) to your cart, then download the manifest file (save the manifest file as gdc_biospecimen.txt).
  2. Using NCI's gdc-client software, run the following command to download the file into the path: tcga/biospecimen:
gdc-client download -m gdc_biospecimen.txt -d tcga/biospecimen
  1. Then, run the following command to format the downloaded batch identifiers into tcga/tcga_batches.csv:
Rscript tcga/processgdc.r tcga/biospecimen tcga/tcga_batches.csv
  1. Finally, download TCGA sample information into the tcga/ directory (more details on TCGA sample information are described below).
wget --continue https://gdc.cancer.gov/files/public/file/tcga_code_tables.zip
unzip tcga_code_tables.zip bcrBatchCode.tsv
unzip tcga_code_tables.zip tissueSourceSite.tsv
mv bcrBatchCode.tsv tcga/
mv tissueSourceSite.tsv tcga/

Processing downloaded TCGA data

Now, we can process the TCGA data to only include samples of interest and to batch-correct batch ID and tissue source site information. Note that TCGA sample information (e.g. tissue source sites, tumor types, tumor vs. normal, etc.) is described at https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/ and how that information is encoded into TCGA barcodes can be found at https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/

Process TCGA data to get our final batch-corrected output in the tcga/ directory:

Rscript tcga/processTCGAbatches.r tcga/TCGA.rsem.deseq2.log2.tsv.gz tcga/bcrBatchCode.tsv tcga/tissueSourceSite.tsv tcga/TCGA.processed. tcga/tcga_batches.csv

The output files from the preceding command (i.e. the processed files associated with the batch-corrected data) will include:

  • TCGA.processed.tumor_normal_lfc.tsv.gz - Log2 fold changes of paired tumor versus matched normal
  • Batch correction:
    • TCGA.processed.info_corrected.csv - Information about cancer types, tumor/normal, batches, etc. for each sample
    • TCGA.processed.batch_effects_view.pdf - PCA plots of the result of batch correction
    • TCGA.processed.tumors_corrected.tsv.gz - Tumor sample expression data (n = 9354 primary cancers)
  • Non-batch corrected:
    • TCGA.processed.info.csv - Information about cancer types, tumor/normal, batches, etc. for each sample
    • TCGA.processed.tumors.tsv.gz - Tumor sample expression data

Note: The smaller files are in the tcga/ directory of this Github repository. However, the larger .tsv.gz files can be downloaded from:

Obtaining TCGA clinical data (optional)

Note: Not used in paper

For clinical data (for survival analysis), use the TCGA-CDR data file TCGA-CDR-SupplementalTableS1.xlsx which can be downloaded from the PanCanAtlas at https://api.gdc.cancer.gov/data/1b5f413e-a8d1-4d10-92eb-7c4ae739ed81 -- this file can also be found in the tcga/ directory of this GitHub repository.

TCGA survival analysis (Cox regression meta-analysis) (optional)

Note: Not used in paper

To obtain Cox regression z-scores and meta-z scores for TCGA data, run the following:

Rscript tcga/tcga_survival.r tcga/TCGA-CDR-SupplementalTableS1.xlsx tcga/TCGA.processed.tumors_corrected.tsv.gz tcga/TCGA.processed.info_corrected.csv 4 tcga/tcga_zscores.csv

The file tcga/tcga_zscores.csv will contain the pan-cancer survival z-scores (a gzip-compressed version of the file, tcga/tcga_zscores.csv.gz, is available in the tcga/ directory of this GitHub repository).

Integrating Differential Gene Expression Analyses

The following command will integrate information from multiple differential gene expression analyses: microarrays of liver, kidney, and lung cancers as well as RNA-seq of lymphomas. We control FDR at 0.05 and set the fold change threshold to 2 (i.e. a log2 Fold Change of 1). The output will be placed in output/mouse_de/.

Rscript overlap.r 2 0.05 output/mouse_de/ liver_myc,kidney_myc,lung_myc,lung_mycras,lung_ras,tall_myc,eumyc_myc "HCC,RCC,LAC (MYC),LAC (MYC+KRAS),LAC (KRAS),T-ALL,BCL" microarray/processed_microarray/tissue/liver_diffexpr_gene.csv,microarray/processed_microarray/tissue/kidney_diffexpr_gene.csv,microarray/processed_microarray/tissue/lung_diffexpr_gene.csv,microarray/processed_microarray/tissue/lung_mycras_diffexpr_gene.csv,microarray/processed_microarray/tissue/lung_ras_diffexpr_gene.csv,rnaseq/tall_rnaseq/kallisto_aligned/tallmycon_sleuth_results_genes.csv,rnaseq/eumyc_rnaseq/kallisto_aligned/eumycon_sleuth_results_genes.csv microarray/annotation/MouseWG-6v2.csv rnaseq/annotation/gencode_GRCm38_vM15.csv

Downstream Analyses of Mouse Differential Gene Expression

Mouse tissue enrichment analysis

Run the following script for mouse tissue enrichment analysis. Output folder: output/enrichr_mouse_tissue

Rscript analyze_mouse_tissue_enrichment.r

Mouse DE genes GO analysis

Run the following script to perform GSEA of GO terms for mouse DE genes. For the top enriched GO terms, the script will also perform GO overrepresentation analysis for relevant mouse gene atlas tissue gene sets. This is to assess whether enriched GO terms in mouse DE genes have tissue-lineage dependence. Output folder: output/enrichr_mouse_go

The script has two parts. The first part as follows and will generate the GSEA results as well as a list of top GO terms:

Rscript analyze_mouse_go_enrichment.r

The GSEA results will be outputted in go_gsea.xlsx and the top GO terms will be outputted in topterms.txt in the output folder. We then manually curate the top GO terms and assign them to broader categories for ease of display. The category assignment should be in a .csv file with each column name being a category name and with each GO term being assigned to a specific column. One column should be named "Other" and this contains GO terms that couldn't be assigned to a particular category (and therefore won't be displayed on the heatmap). Finally, we run the second part of the script -- supplying the .xlsx GSEA results (generated from the first part) and the .csv category assignment file (that we manually created):

Rscript analyze_mouse_go_enrichment.r output/enrichr_mouse_go/go_gsea.xlsx data/go_term_collapse.csv

CCLE

Obtaining CCLE data from Broad Institute

Let's download STAR-aligned RSEM-quantified CCLE data (n = 1019 cells) from The Broad Institute and put them into the ccle/ directory.

We obtain TPM values from RSEM quantification as follows:

wget --continue https://data.broadinstitute.org/ccle/CCLE_RNAseq_rsem_genes_tpm_20180929.txt.gz
mv CCLE_RNAseq_rsem_genes_tpm_20180929.txt.gz ccle/

We obtain log2(x+1)-transformed DESeq-normalized TPM values as follows:

Rscript ccle/ccle_normalize.r ccle/CCLE_RNAseq_rsem_genes_tpm_20180929.txt.gz ccle/CCLE_RNAseq_rsem_genes_tpm.deseq.log2.tsv
gzip ccle/CCLE_RNAseq_rsem_genes_tpm.deseq.log2.tsv

The normalized TPM file should be OK for either within-sample (length-normalized) or between-sample (size factor normalized) analyses.

Alternately, the TPM and normalized TPM files are posted here (which you can download and put into the ccle/ directory):

Pair-wise correlations

To generate pair-wise Pearson correlations with MYC, run the following:

Rscript ccle/ccle_pairwise_correlation.r ENSG00000136997 ccle/CCLE_RNAseq_rsem_genes_tpm.deseq.log2.tsv.gz ccle/CCLE_myc_cor.csv

The output file is ccle/CCLE_myc_cor.csv

Pan-Cancer Rank Aggregation

Rank Aggregation

To generate pair-wise correlations with MYC and to perform rank aggregation with Robust Rank Aggregation (RRA) of the pair-wise Pearson correlation coefficients across all TCGA cancer types, run the following:

Rscript rankaggregation.r tcga/TCGA.processed.tumors_corrected.tsv.gz tcga/TCGA.processed.info_corrected.csv output/tcga_correlation/

The output files will be located in output/tcga_correlation/ -- the file myc_rra.csv contains the RRA p-values and the file myc_correlations.csv (which you can compress to make myc_correlations.csv.gz since the file is rather large) contains the pair-wise Pearson correlation coefficients.

Combined Human + Mouse Signature

To combine the human and mouse signatures to generate a signature containing genes differentially expressed (in the same direction) in at least 4 out of 5 MYC mouse tumor models and with median Pearson's r > 0.30 (and RRA adjusted p-value < 0.05) across the TCGA cancer types, run the following:

Rscript integrative_signature.r

The output files will be found in output/integrative_signature/

Downstream Analysis of Combined Signature

To perform downstream analysis of the combined signature generated above (e.g. survival analysis, CCLE correlation, how it compares to the MYC gene sets which were processed in the genesets/ folder, etc.), run the following:

Rscript analyze_signatures.r

Note: To perform ONLY the analysis relevant to the paper (i.e. without survival analysis and comparisons to other MYC gene sets), run the following instead:

Rscript analyze_signatures2.r

The output files will be found in output/signature_analysis/

ChIP-Seq

Details are located at https://github.com/Yenaled/felsher/tree/master/chipseq

Figures

Figures in paper were generated as follows:

    • A: Self-created figure
    • B:
      • PDF file: output/mouse_de/venn.pdf
      • Raw output: output/mouse_de/
      • Link to method
    • C/D:
      • Graphpad Prism file: prism/mouse_tissue_enrichment.pzfx
      • Raw output: output/enrichr_mouse_tissue/
      • Link to method
    • A:
      • PDF file: output/enrichr_mouse_go/heatmap_go.pdf
      • Raw output:
        • output/enrichr_mouse_go/go_gsea.xlsx
        • output/enrichr_mouse_go/tissue_go.xlsx
        • output/enrichr_mouse_go/heatmap_go_terms.csv
      • Link to method
    • A/B:
      • PDF files: chipseq/output/figures
      • Raw output: chipseq/output/mat, chipseq/output/se_h3k27ac_data.xlsx
      • Link to method
    • A:
      • PDF file: output/integrative_signature/volcano.pdf
      • Raw output:
        • output/integrative_signature/correlation_data.csv
      • Link to method
    • B:
      • Self-created figure
      • Raw output:
        • output/integrative_signature/log.txt
      • Link to method
    • C:
      • PDF file: output/integrative_signature/barcode_top_5_pathways_and_embryonic.pdf
      • Raw output:
        • output/integrative_signature/correlation_data.csv
        • output/integrative_signature/signature_go_bp.xlsx
        • output/integrative_signature/signature_tumorigenesis_go_bp.xlsx
      • Link to method
    • D:
      • STRING: functional protein association networks - string-db.org
      • Procedure:
        • Copy and paste list of gene symbols from output/integrative_signature/signature.csv, select Homo Sapiens as organism, set "meaning of network edges" to be "confidence", select all for "active interaction sources", set "minimum required interaction score" to be 0.400, and set "max number of interactors to show" to be "query proteins only".
    • E:
      • PDF file: output/signature_analysis/ccle_clustering.pdf
      • Raw output:
        • output/signature_analysis/ccle_clustering.csv
      • Link to method
    • F:
      • Graphpad Prism file: qpcr/qpcr_validation.pzfx
      • Raw output:
        • qpcr/summary_p493.xlsx
        • qpcr/summary_ec4.xlsx
    • A-D generated by experiment (immunohistochemistry); E: See Supplementary Figure 7. All added during peer-review
    • Self-created figure

Tables in paper were generated as follows:

    • additional_tables/mouse_strains.docx
      • Self-created table

Supplementary figures in paper were generated as follows:

    • Self-created figure
    • output/mouse_de/upset_up.pdf output/mouse_de/upset_down.pdf
    • chipseq/output/figures/plot_eumyc_h3k27ac_dbsuper_using_hccgenes.pdf chipseq/output/figures/plot_hcc_h3k27ac_dbsuper_using_eumycgenes.pdf
      • Raw output: chipseq/output/mat/eumyc_h3k27ac_dbsuper_log2FC.mat.gz, chipseq/output/mat/hcc_h3k27ac_dbsuper_log2FC.mat.gz
      • Link to method
    • output/signature_analysis/tcga_lfc.pdf
    • qpcr/myc.pdf
      • Graphpad Prism file: qpcr/qpcr_validation.pzfx
      • Raw output: qpcr/summary_p493.xlsx, qpcr/summary_ec4.xlsx
    • Generated by experiment (immunohistochemistry); added during peer-review
    • output/signature_analysis/tcga_metaz.pdf
      • Raw output: output/signature_analysis/tcga_metaz.csv
      • This survival analysis was added during peer-review, and requires running the full `Rscript analyze_signatures.r` rather than the abridged `Rscript analyze_signatures2.r` (Link to method)

Supplementary tables in paper were generated as follows:

    • output/enrichr_mouse_tissue/tissue_enrichment_table.xlsx
    • output/integrative_signature/correlation_data_formatted.csv
    • output/integrative_signature/signature_myc_correlation.csv output/integrative_signature/signature_tumorigenesis.csv output/integrative_signature/signature.csv
    • output/integrative_signature/signature_go_bp.xlsx output/integrative_signature/signature_go_mouse_cells.xlsx output/integrative_signature/signature_tumorigenesis_go_bp.xlsx output/integrative_signature/signature_tumorigenesis_mouse_cells.xlsx
    • additional_tables/qpcr_primers.docx
      • Self-created table
    • tcga/tcga_zscores.csv.gz
      • This survival analysis was added during peer-review, and requires running the full `Rscript analyze_signatures.r` rather than the abridged `Rscript analyze_signatures2.r` (Link to method)

Appendix

Software Versions

  • Graphpad Prism 8.4.1
  • gdc-client 1.5.0
  • R version 3.6.1
  • GEOquery 2.54.0
  • lumi 2.38.0
  • limma 3.42.0
  • aggregation 1.0.1
  • UpSetR 1.4.0
  • VennDiagram 1.6.20
  • openxlsx 4.1.5
  • kallisto 0.44.0
  • sleuth 0.30.0
  • DESeq2 1.22.2
  • GenomicFeatures 1.38.0
  • fgsea 1.12.0
  • enrichR 2.1
  • XML 3.98-1.20
  • sva 3.34.0
  • BiocParallel 1.20.0
  • ComplexHeatmap 2.2.0
  • ggplot2 3.2.1
  • ggbeeswarm 0.6.0
  • survival 3.1-12
  • RobustRankAggreg 1.1
  • Rtsne 0.15

Gene identifiers ortholog mapping

To perform mapping between gene identifiers (e.g. NCBI ID, ENSEMBL, Symbol, etc.) or orthologs for annotation purposes, please see the files distributed here:

https://ftp.ncbi.nlm.nih.gov/gene/DATA/

Here are some files of note:

If the links above are broken, the NCBI annotation files (accessed in the year 2020) can be found here: https://github.com/Yenaled/felsher/releases/download/felsher/ncbi_annotations.zip

Loading files in Excel

When loading .csv files or files containing tab-separated values in Excel, use File -> Import to import the files into Excel and set "Column data format" for all columns containing gene names/symbols to "Text" rather than "General". This will allow you to avoid issues with gene names being reformatted (e.g. March7 becoming 7-Mar).