Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weights for Samples of LinDA Modelling #59

Open
DarioS opened this issue Jun 30, 2024 · 2 comments
Open

Weights for Samples of LinDA Modelling #59

DarioS opened this issue Jun 30, 2024 · 2 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@DarioS
Copy link

DarioS commented Jun 30, 2024

I notice that the top-ranked bacteria are sometimes caused by outliers of linda analysis. This happens when some samples have five species and other species have fifty species detected, for instance. I would like linda to have an option for weighting of samples inversely to the number of species detected for each sample. lm has weights parameter. linda should also support weights.

@DarioS DarioS added the enhancement New feature or request label Jun 30, 2024
@cafferychen777
Copy link
Owner

Dear Dario Strbenac,

Thank you for your feedback regarding the LinDA analysis in MicrobiomeStat. I appreciate you bringing this to our attention. To ensure we address your concern effectively, could you please provide some additional clarification?

  1. Could you elaborate on the specific scenarios where you've observed this issue with outliers affecting the top-ranked bacteria?

  2. Do you have any examples or sample datasets that demonstrate this problem? This would be incredibly helpful for us to understand and replicate the issue.

  3. Regarding your suggestion for weighting samples inversely to the number of species detected, could you explain in more detail how you envision this working? Are there any specific parameters or methods you had in mind?

  4. Have you encountered similar weighting options in other tools or packages that you think could serve as a good model for implementing this in LinDA?

Your insights will greatly help us improve MicrobiomeStat and ensure it meets the needs of our users. Thank you for your time and contribution to the project.

Best regards,
Caffery Yang

@DarioS
Copy link
Author

DarioS commented Jun 30, 2024

Oops, inverse weights would not make sense. Standard weights would. We noticed the issue when we identified the same species of bacteria but opposite fold change to published by another team who also did qPCR validation. The comparison was healthy volunteers versus cancer-adjacent normal tissue. Note that two Healthy group samples have high proportions.

> healthy[1, ] # Just the ten healthy volunteer samples subset of the whole matrix.
                        Oral_1-N Oral_2-N Oral_3-N Oral_4-N Oral_5-N Oral_6-N Oral_7-N Oral_8-N Oral_9-N Oral_10-N
Cutibacterium acnes            0        0    0.113      0.1     0.69    0.025     0.71    0.093        0     0.000

The high proportion is an artefact of only three species detected for samples Oral_5-N and Oral_7-N.

> colSums(healthy > 0)
 Oral_1-N  Oral_2-N  Oral_3-N  Oral_4-N  Oral_5-N  Oral_6-N  Oral_7-N  Oral_8-N  Oral_9-N Oral_10-N 
        2         0        14        14         3         5         3         7         6        15

Code to reproduce the Cutibacterium acnes contradictory result is:

library(MicrobiomeStat)
load("testLindaRobustness.RData")
normalsCancerFit <- linda(bacteriaMatrixNormals, clinicalTableNormals, "~ Age + Smoking + Gender + `Tissue Type`",
                          "proportion")
results <- normalsCancerFit[["output"]][["`Tissue Type`Normal"]]
results["Cutibacterium acnes", c("log2FoldChange", "lfcSE", "stat", "pvalue", "padj")]

                    log2FoldChange    lfcSE      stat       pvalue         padj
Cutibacterium acnes      -8.123449 1.859128 -4.369493 5.445916e-05 0.0009802648

If I plot the proportions, I see that there are two outliers in Healthy group and one outlier in Normal group.

library(ggplot2)
theme_set(theme_bw())
plotData <- data.frame(sampleID = colnames(bacteriaMatrixNormals),
                       proportion = bacteriaMatrixNormals["Cutibacterium acnes", ],
                       group = clinicalTableNormals[, "Condition"])
ggplot(plotData, aes(x = sampleID, y = proportion)) + geom_bar(stat = "identity") +
       facet_grid(cols = vars(group), scales = "free_x", space = "free_x") +
       geom_hline(yintercept = 1.00, linetype = "dashed", colour = "red") + 
       theme(axis.text.x = element_text(angle = 90)) + ggtitle("Cutibacterium acnes Proportion")

image

In statistics, it is popular to down-weight samples instead of discarding them.

Weighted regression is defined as “a generalization of linear regression where the covariance matrix of errors is incorporated in the model”. In simple terms, this means that not all samples are equal in the eyes of the data scientist, and this inequality should also reflect in the fitted model as well.

So, ideally, I would like to be able to do something like:

myWeights <- colSums(proportionsMatrix > 0)
sampleInfo$myWeights <- myWeights
linda(proportionsMatrix, sampleInfo, "~ status","proportion", weights = "myWeights")

Winsorisation makes almost no difference. I tried TRUE and FALSE. Would weights interfere with bias adjustment, though?

@cafferychen777 cafferychen777 added the good first issue Good for newcomers label Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants