Despite the awful acronym (if it can even be called one), GOrilla is an excellent tool to discover GO term enrichments in a ranking of genes. Unfortunately, the tool is only available (to my knowledge) through a web interface, so I wrote this small script to query GOrilla programmatically, from commandline. This script allows the user to submit an enrichment query for a single target ranking of genes, for H. sapiens, and then scrapes down the results for GO Biological Process, Function, and Component. Other functionalities to take advantage of other GOrilla usages can be easily extended from this script.
python gorilla_query.py (-g [GENE_LIST] | -i [RUN_ID]) -o [OUTPUT_DIR]
GENE_LIST
- A file containing a ranking of
\n
-delimited genes - Genes are preferably given as official gene symbols, but Ensembl IDs are also acceptable (see the web interface for more details
- Either this, or
-i
must be specified, not both
- A file containing a ranking of
RUN_ID
- The run ID for a known GOrilla run
- If specified, skips the submission of any query, and direcly scrapes the results
- Either this, or
-g
must be specified, not both
OUTPUT_DIR
- GOrilla Query outputs three files into this directory (which is created if it does not exist already), corresponding to the gene enrichment tables for Process, Function, and Component
- Files will be
OUTPUT_DIR/process.tsv
,OUTPUT_DIR/function.tsv
, andOUTPUT_DIR/component.tsv
Each output table consists of the following:
Line 1: A stable web link (for about 1 month) to the results on GOrilla's servers (this link also contains more information about results)
Line 2: The total number of genes tested by GOrilla, N
Line 3: Header of the results table
Lines 4-n: Rows of the results table, in decreasing order of significance The following is a description of each column:
term
: The GO-term ID (e.g.GO0051302
)desc
: The description of the GO-term (e.g.regulation of cell division
)pval
: p-value of the term enrichment, uncorrected for multiple hypothesis testingfdr
: FDR of the term enrichment, corrected for multiple hypothesis testingenrichment
: Enrichment value, calculated as (b/n) / (B/N)B
: The total number of genes associated with this term, Bn
: The number of genes that appear at the top of the ranking (see the citation below for more details), nb
: The intersection of genes that appear at the top of the ranking and the genes associated with the term, b
Eran Eden*, Roy Navon*, Israel Steinfeld, Doron Lipson and Zohar Yakhini. "GOrilla: A Tool For Discovery And Visualization of Enriched GO Terms in Ranked Gene Lists", BMC Bioinformatics 2009, 10:48.