-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getXrefList(...) method is broken #30
Comments
Hi @OliveIsLazy, we're running into a few issues ourselves. Thank you for our report. I hope we can fix things as soon as possible. |
@OliveIsLazy as an alternative, you can download the mouse pathway collection in GMT format which has all the entrez gene ids for every approved mouse pathway. should also be a lot faster. library(rWikiPathways)
wp.mm <- rWikiPathways::downloadPathwayArchive(date="20230210", organism="Mus musculus", format="gmt")
wp2gene <- readPathwayGMT(wp.mm) |
@mkutmon This method has helped! The dataframe had a lot of the information I needed. None of the pathways have no genes in them, though pathway WP4627 (" Lipids measured in liver metastasis from breast cancer") has only one gene it in. Here is the code block I used in case anyone else is interested. #' Generate a list of lists, where each sublist is a WikiPathway genesets.
#' The sublist names correspond to the genesets' name.
#' The sublist values correspond to the genes in the geneset.
#' bug report: https://github.com/wikipathways/rWikiPathways/issues/30#issuecomment-1460013132
#' #' DATE: 08-03-2023
get_wikipathways_two <- function() {
# Get metadata.
wp.mm <- rWikiPathways::downloadPathwayArchive(date="20230210", organism="Mus musculus", format="gmt")
wp2gene <- readPathwayGMT(wp.mm)
# Set up key variables.
pwid <- sort(unique(wp2gene$wpid))
gs <- list()
# For each pathway id, find the rows with a matching ID and save the associated genes in a list
for (i in pwid)
{
gs[[length(gs)+1]] <- wp2gene[(wp2gene$wpid==i), "gene"]
# Acts as a fail-safe in case there's an issue with the id-pathwayname mapping.
names(gs)[length(gs)] <- unique(wp2gene[(wp2gene$wpid==i), "name"])
}
return(gs)
} By the way, I checked the number of separate genesets in the function you sent me and it totals to 199, which is less than the amount I recorded from December 14th 2022, 238. And indeed, the function I posted initially still generate 238 genesets. I checked and the 199 current pathways appear to all be a subset of the previous 238. Do you know what happened to these missing pathways or how I could get access to them? On that note, I'd also like to know if any of the pathway definitions (aka the genes in them) has changed since this date, if that's something you're able to tell me. Thank you very much for your help and your quick response, it is much appreciated. |
Ah, the GMT file only contains approved pathways. Pathways under development and not finished pathways are not included in our monthly releases. I will double-check if that is indeed the case for those 39 mouse pathways. Hope that helps! |
I tried messing around a little and I haven't found any changes Dec 2022 and Feb 2023, but I really appreciate being able to confirm that! This next part doesn't really relate to the coding backend, so I apologise if it's a bit rude, but I went and checked for which pathways I was missing for my analysis I'm in need of these five pathways:
I hope to hear back soon on whether or not their gene lists can be made available again. |
WP2081 --> not Approved; Homology Converted
WP1271 --> Not Approved; Homology Converted
WP3415 --> Not Approved
WP4491 --> Not Approved
WP1396 --> Approved, so this one should be available in the GMT download @mkutmon ...? I checked, and the GMT release of Feb. 2023 includes this PW ID, so not sure why it's not available in R? And, how can someone make a GMT file for unapproved PWs? @mkutmon @egonw ? Maybe we should make that option available in the future? |
WP1396 is also in the R table. |
Agreed. GMTs serve a specific purpose for enrichment analysis. We really don't want to make or share GMTs with unapproved content. |
I can confirm that WP1396 is in the R table. I checked using the pathway names where the December 2022 version is called "GPCRs, non-odorant" and the February 2023 version is of the same name except with a space added at the end, hence the mismatch. My apologies for the confusion. |
Okay, we worked hard to solve a number of issues, including fixes in the BridgeDb Webservice, the ID mappings databases, uptime of the data.wikipathways.org server, and the indexer. I guess one of the last two was underlying your observation. @OliveIsLazy, can you confirm your original code is working again? I get: > getXrefList("WP3415", "L")
[1] "100038882" "100039796" "100702" "106759" "110454" "12675"
[7] "142980" "15945" "15957" "15959" "15962" "15975"
[13] "15977" "15979" "16150" "16151" "17067" "17858"
[19] "18033" "18035" "19765" "20304" "20558" "20684"
[25] "21353" "21822" "22031" "22034" "223672" "228607"
[31] "230073" "231655" "23960" "23961" "23962" "240327"
[37] "240328" "24110" "246727" "246728" "246729" "246730"
[43] "26409" "270151" "320377" "327959" "434341" "54123"
[49] "54131" "55932" "56066" "56417" "56480" "56489"
[55] "58185" "58203" "60440" "620913" "623121" "626578"
[61] "66513" "66724" "667370" "667373" "67775" "68652"
[67] "71586" "71898" "74498" "80861" "99899" But I cannot say the results are very stable. It seems that ID mapping does not take place: > getXrefList("WP2338", "En")
[1] "ENSG00000100697" "ENSG00000113360" "ENSG00000124571" "ENSG00000128191"
> getXrefList("WP2338", "L")
[1] "5901" The returned Ensembl and NCBI identifiers here are only those that are explicitly in the pathways and none of the mapped identifiers. We see this problem also on BioConductor: https://bioconductor.org/checkResults/release/bioc-LATEST/rWikiPathways/lconway-checksrc.html Basically, it seem the reboot of the Indexer solved some problems, but not all. The Indexer is not using the latest ID mapping file, which we hope to get from @tabbassidaloii soon. |
I created an issue in the indexer repo: wikipathways/lucene-indexer#4. It has nothing to do with the new mapping databases so I need more time to actually look at the code and check what is happening and why the ID mapping doesn't work anymore. |
I ran the code I originally posted, and all the genesets that I needed to match did based on the name (except the "GPCRs, non-odorant" but only due to the aforementioned naming difference). It generated 238 pathways, as it did in December. I tried checking the code again against another older file I had that saved 195 of the genesets' WikiPathway IDs, and all 195 genesets were accounted for. As @egonw pointed out, there are issues with the ID mapping. 71 of the genesets have either no or one EntrezIDs in them. Here's two examples:
Thank you all very much for helping me, I really appreciate the hard work. |
@OliveIsLazy ; I'm assuming you are running your code for the mouse species, right?
@egonw and @mkutmon : seems like the mapping to other databases is (still) not happening.
When I try out the batch mapping, I get no matches:
The same happens when I use the batch mapping, but just input 1 ID. |
Yes, all the code I've ran has been only for the mice pathways. |
For a few months, I've struggled with getting this method to work (first issues relating to getting geneset names that later resolved itself, and now issues with getting the actual genes in the genesets), which is strange because the issues started after the last commit.
The code I use is:
This code used to work for me, but now it produces incomplete returns like this (I only copied the first few outputs):
I was wondering if it had to do with calling bridgedb's datasources repo, as their datasources.tsv file has been moved from the path described in location in getXrefList(...)'s help page.
Please tell me if you need more information!
The text was updated successfully, but these errors were encountered: