For the GSEApy ssGSEA module, why should all genes in the gene sets be present in the gene expression table? #203

joshscurll · 2023-04-27T21:54:57Z

joshscurll
Apr 27, 2023

The GSEApy documentation for ssGSEA (Section 2.5) states, "Note: When you run ssGSEA, all genes names in your gene_sets file should be found in your expression table". Why? I don't understand why this would be necessary -- surely the enrichment scores can just be calculated from the genes that are present.

In general, whenever I want to run ssGSEA, some genes in the gene sets will be missing from the data. This often happens as a result of removing genes that have expression values of 0 in all samples. Theoretically, too many genes having tied ranks due to 0 expression is more problematic for ssGSEA than missing genes. Also, when I combine data from different batches / data sources, I may use the intersection (rather than union) of genes expressed in each batch / data source to avoid some batch effects -- this also results in missing genes. I would like to understand why GSEApy wants all genes to be present. Mathematically, there is no reason for this, but I struggle to follow the Rust code for the ss_gsea function to see if there's a programmatic implementation reason.

Answered by zqfang

May 4, 2023

Hi @joshscurll , sorry for reply late

You are right. In math, there're no requirement for all expressed gene input. Also, it's not a programmatic implementation reason

You could do whatever gene list you think is reasonable for the calculation.

What I mean is that all genes in the gene_set file (GMT) are ideally found in the gene expression table. (internally, a missing gene will be thrown away. however, some interesting genes will be dropped due to it's not found in the gene expression table)

Another reason to include all expressed genes is to "enrich" the given gene_set with a reasonable background data distribution for calculation.

But again, it's not mandatory

Hope it help.

I will upd…

View full answer

zqfang · 2023-05-04T17:33:17Z

zqfang
May 4, 2023
Maintainer

Hi @joshscurll , sorry for reply late

You are right. In math, there're no requirement for all expressed gene input. Also, it's not a programmatic implementation reason

You could do whatever gene list you think is reasonable for the calculation.

What I mean is that all genes in the gene_set file (GMT) are ideally found in the gene expression table. (internally, a missing gene will be thrown away. however, some interesting genes will be dropped due to it's not found in the gene expression table)

Another reason to include all expressed genes is to "enrich" the given gene_set with a reasonable background data distribution for calculation.

But again, it's not mandatory

Hope it help.

I will update the docs to make it more clear

Zhuoqing

1 reply

joshscurll May 4, 2023
Author

Thanks for this reply!

internally, a missing gene will be thrown away.

That is the clarification that I needed and is precisely how I hoped genes in the gene sets that were missing in the expression table were handled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For the GSEApy ssGSEA module, why should all genes in the gene sets be present in the gene expression table? #203

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

For the GSEApy ssGSEA module, why should all genes in the gene sets be present in the gene expression table? #203

joshscurll Apr 27, 2023

Replies: 1 comment · 1 reply

zqfang May 4, 2023 Maintainer

joshscurll May 4, 2023 Author

joshscurll
Apr 27, 2023

Replies: 1 comment 1 reply

zqfang
May 4, 2023
Maintainer

joshscurll May 4, 2023
Author