For the GSEApy ssGSEA module, why should all genes in the gene sets be present in the gene expression table? #203
-
The GSEApy documentation for ssGSEA (Section 2.5) states, "Note: When you run ssGSEA, all genes names in your gene_sets file should be found in your expression table". Why? I don't understand why this would be necessary -- surely the enrichment scores can just be calculated from the genes that are present. In general, whenever I want to run ssGSEA, some genes in the gene sets will be missing from the data. This often happens as a result of removing genes that have expression values of 0 in all samples. Theoretically, too many genes having tied ranks due to 0 expression is more problematic for ssGSEA than missing genes. Also, when I combine data from different batches / data sources, I may use the intersection (rather than union) of genes expressed in each batch / data source to avoid some batch effects -- this also results in missing genes. I would like to understand why GSEApy wants all genes to be present. Mathematically, there is no reason for this, but I struggle to follow the Rust code for the ss_gsea function to see if there's a programmatic implementation reason. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @joshscurll , sorry for reply late You are right. In math, there're no requirement for all expressed gene input. Also, it's not a programmatic implementation reason You could do whatever gene list you think is reasonable for the calculation. What I mean is that all genes in the gene_set file (GMT) are ideally found in the gene expression table. (internally, a missing gene will be thrown away. however, some interesting genes will be dropped due to it's not found in the gene expression table) Another reason to include all expressed genes is to "enrich" the given gene_set with a reasonable background data distribution for calculation. But again, it's not mandatory Hope it help. I will update the docs to make it more clear Zhuoqing |
Beta Was this translation helpful? Give feedback.
Hi @joshscurll , sorry for reply late
You are right. In math, there're no requirement for all expressed gene input. Also, it's not a programmatic implementation reason
You could do whatever gene list you think is reasonable for the calculation.
What I mean is that all genes in the gene_set file (GMT) are ideally found in the gene expression table. (internally, a missing gene will be thrown away. however, some interesting genes will be dropped due to it's not found in the gene expression table)
Another reason to include all expressed genes is to "enrich" the given gene_set with a reasonable background data distribution for calculation.
But again, it's not mandatory
Hope it help.
I will upd…