Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene dropout from modules and errors for alternative infer_grn methods #66

Open
simpakkk opened this issue Jul 29, 2024 · 1 comment
Open

Comments

@simpakkk
Copy link

simpakkk commented Jul 29, 2024

Hi @joschif !

Thank you for this helpful package :) I have a couple of questions!

To start, I will just say - I have already tested a lot of parameters in my multiome dataset and can use the default glm method without issue. I have been constraining my peaks to PhastCons 'Vert 35 El' + ENCODE CREs + any additional peak-gene linkages in my own dataset (I have also tried running versions without constraints). I previously tried using only my own peak-gene linkages, but this returned very small networks that were not useful, so now I am using all conserved regulatory regions again. This gives me a similar ballpark to the number of regions specified in the vignette for the test dataset (~140,000). I have been constraining my gene set to include only genes that are in Variable Features I am using for my analysis (~3000 genes). So I don't think any of the issues I am having below can be overcome by constraining anything further, as discussed in a few different GitHub posts. For reference as well, I am using a mouse dataset (so I am using mouse-compatible PhastCons, PWMs, etc. as well as the Signac peak-to-gene linkage method, as I believe there is still an issue with the GREAT method here).

  1. I have been using the default glm model to try to generate a network and predict modules in my dataset. I find that certain genes I know to be biologically truthfully correlated (via experimental KO validation; ie genes whose expression is lost when a TF is knocked out) come up as having relatively 'strong' correlations in my dataset (e.g. corr >0.15) with adjusted p values < 0.05 within the 'coef' table. But then when I move to the next step and find modules, these genes and correlations often do not make it into the module for that TF, even though the TF itself does appear as a module, and I am using what I would think would be low enough thresholds for those genes to be 'kept' (e.g. R sq > 0.05, p val < 0.05 cutoffs). Can you explain how modules are chosen based on the output of 'Infer GRN' and why genes that seem to have a relatively high correlation and p value below the cutoff I've specified in the function don't get included as part of a TF's module? Perhaps I am misunderstanding how those cutoffs are applied?

  2. Given that I am losing some key genes that come up in the coef tables from my TF modules, I wondered if this might change if I used a different method in infer grn. I tried to run GREAT instead of Signac for peak-gene links, and it was still erroring for me, presumably because of the mouse-to-human issue described here I also tried running the Bagging Ridge method with Signac peak method, which I was interested to try as I noticed the p values are much smaller in the vignette when this method is used. However, I couldn't get this working for me - I am using reticulate and have my python environment set up such that scikit-learn does load and is called successfully, but I get an error that 'fitting model failed for x gene' for every single gene in my dataset one-by-one when running the code (it prints like 3000 lines that fitting model has failed for each gene as it runs through, not the error message 'fitting model failed for all genes'). Inspecting the traceback() showed that this might also be related to human vs. mouse incompatibility? Here is the specific error I received:

12: makeRestartList(...)
11: withRestarts({
    signalCondition(cond)
    defaultHandler(cond)
  }, muffleMessage = function() NULL)
10: message(paste0(...))
9: log_message("Warning: Fitting model failed for ", g, verbose = verbose)
8: FUN(X[[i]], ...)
7: lapply(X[Split[[i]]], FUN, ...)
6: pbapply::pblapply(X = x, FUN = fun)
5: map_par(features, function(g) {
    if (!g %in% rownames(peaks2gene)) {
      log_message("Warning: ", g, " not found in EnsDb", verbose = verbose == 
        2)
      return()
    }
    gene_peaks <- as.logical(peaks2gene[g, ])
    if (sum(gene_peaks) == 0) {
      log_message("Warning: No peaks found near ", g, verbose = verbose == 
        2)
      return()
    }
    g_x <- gene_data[gene_groups, g, drop = FALSE]
    peak_x <- peak_data[peak_groups, gene_peaks, drop = FALSE]
    peak_g_cor <- as(sparse_cor(peak_x, g_x), "generalMatrix")
    peak_g_cor[[is.na](http://is.na/)(peak_g_cor)] <- 0
    peaks_use <- rownames(peak_g_cor)[abs(peak_g_cor[, 1]) > 
      peak_cor]
    if (length(peaks_use) == 0) {
      log_message("Warning: No correlating peaks found for ", 
        g, verbose = verbose == 2)
      return()
    }
    peak_x <- peak_x[, peaks_use, drop = FALSE]
    peak_motifs <- peaks2motif[gene_peaks, , drop = FALSE][peaks_use, 
      , drop = FALSE]
    gene_peak_tfs <- map(rownames(peak_motifs), function(p) {
      x <- as.logical(peak_motifs[p, ])
      peak_tfs <- colMaxs(motif2tf[x, , drop = FALSE])
      peak_tfs <- colnames(motif2tf)[as.logical(peak_tfs)]
      peak_tfs <- setdiff(peak_tfs, g)
      return(peak_tfs)
    })
    names(gene_peak_tfs) <- rownames(peak_motifs)
    gene_tfs <- purrr::reduce(gene_peak_tfs, union)
    tf_x <- gene_data[gene_groups, gene_tfs, drop = FALSE]
    tf_g_cor <- as(sparse_cor(tf_x, g_x), "generalMatrix")
    tf_g_cor[[is.na](http://is.na/)(tf_g_cor)] <- 0
    tfs_use <- rownames(tf_g_cor)[abs(tf_g_cor[, 1]) > tf_cor]
    if (length(tfs_use) == 0) {
      log_message("Warning: No correlating TFs found for ", 
        g, verbose = verbose == 2)
      return()
    }
    tf_g_corr_df <- as_tibble(tf_g_cor[unique(tfs_use), , drop = F], 
      rownames = "tf", .name_repair = "check_unique") %>% rename(tf = 1, 
      corr = 2)
    frml_string <- map(names(gene_peak_tfs), function(p) {
      peak_tfs <- gene_peak_tfs[[p]]
      peak_tfs <- peak_tfs[peak_tfs %in% tfs_use]
      if (length(peak_tfs) == 0) {
        return()
      }
      peak_name <- str_replace_all(p, "-", "_")
      tf_name <- str_replace_all(peak_tfs, "-", "_")
      formula_str <- paste(paste(peak_name, interaction_term, 
        tf_name, sep = " "), collapse = " + ")
      return(list(tfs = peak_tfs, frml = formula_str))
    })
    frml_string <- frml_string[!map_lgl(frml_string, is.null)]
    if (length(frml_string) == 0) {
      log_message("Warning: No valid peak:TF pairs found for ", 
        g, verbose = verbose == 2)
      return()
    }
    target <- str_replace_all(g, "-", "_")
    model_frml <- as.formula(paste0(target, " ~ ", paste0(map(frml_string, 
      function(x) x$frml), collapse = " + ")))
    nfeats <- sum(map_dbl(frml_string, function(x) length(x$tfs)))
    gene_tfs <- purrr::reduce(map(frml_string, function(x) x$tfs), 
      union)
    gene_x <- gene_data[gene_groups, union(g, gene_tfs), drop = FALSE]
    model_mat <- as.data.frame(cbind(gene_x, peak_x))
    if (scale) 
      model_mat <- as.data.frame(scale(as.matrix(model_mat)))
    colnames(model_mat) <- str_replace_all(colnames(model_mat), 
      "-", "_")
    log_message("Fitting model with ", nfeats, " variables for ", 
      g, verbose = verbose == 2)
    result <- try(fit_model(model_frml, data = model_mat, method = method, 
      ...), silent = TRUE)
    if (any(class(result) == "try-error")) {
      log_message("Warning: Fitting model failed for ", g, 
        verbose = verbose)
      log_message(result, verbose = verbose == 2)
      return()
    }
    else {
      result$gof$nvariables <- nfeats
      result$corr <- tf_g_corr_df
      return(result)
    }
  }, verbose = verbose, parallel = parallel)
4: fit_grn_models.GRNData(object = object, genes = genes, network_name = network_name, 
    peak_to_gene_method = peak_to_gene_method, upstream = upstream, 
    downstream = downstream, extend = extend, only_tss = only_tss, 
    parallel = parallel, tf_cor = tf_cor, peak_cor = peak_cor, 
    aggregate_rna_col = aggregate_rna_col, aggregate_peaks_col = aggregate_peaks_col, 
    method = method, alpha = alpha, family = family, interaction_term = interaction_term, 
    adjust_method = adjust_method, scale = scale, verbose = verbose, 
    ...)
3: fit_grn_models(object = object, genes = genes, network_name = network_name, 
    peak_to_gene_method = peak_to_gene_method, upstream = upstream, 
    downstream = downstream, extend = extend, only_tss = only_tss, 
    parallel = parallel, tf_cor = tf_cor, peak_cor = peak_cor, 
    aggregate_rna_col = aggregate_rna_col, aggregate_peaks_col = aggregate_peaks_col, 
    method = method, alpha = alpha, family = family, interaction_term = interaction_term, 
    adjust_method = adjust_method, scale = scale, verbose = verbose, 
    ...)
2: infer_grn.GRNData(whole_so_use, model = "bagging_ridge", peak_to_gene_method = "Signac", 
    genes = expanded_vf_no_hs, upstream = 1e+05, downstream = 1e+05, 
    parallel = F, tf_cor = 0.05)
1: infer_grn(whole_so_use, model = "bagging_ridge", peak_to_gene_method = "Signac", 
    genes = expanded_vf_no_hs, upstream = 1e+05, downstream = 1e+05, 
    parallel = F, tf_cor = 0.05)

Is this because of human-to-mouse genome incompatibility, or potentially some other issue? And do you have any other suggestions on how I could try to capture more of the gene-TF pairs in my network within predicted modules for downstream analysis?

Thank you so much for your help with this and for providing this tool!

@joschif
Copy link
Collaborator

joschif commented Aug 5, 2024

Hi @simpakkk,

  1. The p-values that find_modules is based on are derived from linear models with often very many terms, so it's hard to tell why some things are not significant. One guess might be that if there is a strong correlation on the GEX level, then it could be because of noise or inconsistent signal for the accessibility of the binding peaks. In that case you could try to use aggregate peaks in to highres clusters (aggregate_peaks_col) to make the accessibility signal more robust.

  2. I think the GREAT method is still only supported for human, but the error in the Bagging Ridge I would guess is due to something different. Not sure what though, maybe try verbose = 2 to print lower level errors.

I hope this helps!

Best,
Jonas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants