-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse_pubmed_xml
erases sections/header labels in PMC abstracts
#170
Comments
…aragraph breaks; but still bad because rentrez deletes section headers, filed bug: ropensci/rentrez#170
Hi @gwern , The I can look into whether the library(XML)
parsed_XML <- entrez_fetch(db="pubmed", id=paper$id, rettype="xml", parsed=TRUE)
labels <- sapply(parsed_XML["//Abstract/AbstractText"], xmlGetAttr, "Label")
text <- sapply(parsed_XML["//Abstract/AbstractText"], xmlValue) |
Thanks for the XML code. That works well for me, as I can combine it with the text nicely without too much formatting code: ...
library(XML)
library(tools)
parsed_XML <- entrez_fetch(db="pubmed", id=paper$id, rettype="xml", parsed=TRUE)
labels <- sapply(parsed_XML["//Abstract/AbstractText"], xmlGetAttr, "Label")
labelsFormatted <- sapply(tolower(labels),
function(s) { paste0("<strong>", toTitleCase(s), "</strong>"); })
text <- sapply(parsed_XML["//Abstract/AbstractText"], xmlValue)
combined <- paste0(paste0(labelsFormatted, rep(": ", length(labelsFormatted)), text), collapse="\n\n")
...
abstract <- { if (length(labels) > 1) { combined; } else { fulltext$abstract; } } which gives my necessary results like You may not have noticed the section monkey business if you don't work with abstracts much, or check them against the PMC version, but they seem to be reasonably common. (I had vaguely noticed the issue before but had put off thinking about it until a reader complained about how unreadable solid blocks of text were for some PMC links.) Watching my rebuild, there were at least 89 PMC links on gwern.net affected by the section omission, so those link annotations will be much more readable now. |
When extracting abstracts from Pubmed, the section headers/labels are erased entirely and not present anywhere in the resulting objects or inlined. This makes abstracts substantially harder to read. They should be incorporated somehow (perhaps inlined as
<h3>$Label</h3>
or a separate object field which can be combined with the abstract text fields to reconstruct the original).An example of this using a modafinil paper - the PMC abstract is fully sectionized, with section labels in
<h3>
on the website, and the semantics are present in the raw XML as elements like<AbstractText Label="SETTING" NlmCategory="METHODS">
(where theLabel
is what appears as "Setting"), but the rentrez object afterparse_pubmed_xml
is merely a list of strings, with the labels stripped away. Inspecting the object, I can't find them anywhere in it, and the rentrez XML code looks like it's just dropped (only querying for\\AbstractText
or whatever):That is, the
abstract
elements 1:7 are missing their correspondingc("Study Objectives", "Design", "Setting", "Participants", "Interventions", "Measurements and Results", "Conclusions")
labels. The result then looks like:Not good.
I didn't find anything in the docs or Google about rentrez having alternative ways to parse the PMC XML I am supposed to be using here, it seems to be
parse_pubmed_xml
or bust.The text was updated successfully, but these errors were encountered: