Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_pubmed_xml erases sections/header labels in PMC abstracts #170

Open
gwern opened this issue Apr 11, 2021 · 2 comments
Open

parse_pubmed_xml erases sections/header labels in PMC abstracts #170

gwern opened this issue Apr 11, 2021 · 2 comments

Comments

@gwern
Copy link

gwern commented Apr 11, 2021

When extracting abstracts from Pubmed, the section headers/labels are erased entirely and not present anywhere in the resulting objects or inlined. This makes abstracts substantially harder to read. They should be incorporated somehow (perhaps inlined as <h3>$Label</h3> or a separate object field which can be combined with the abstract text fields to reconstruct the original).

An example of this using a modafinil paper - the PMC abstract is fully sectionized, with section labels in <h3> on the website, and the semantics are present in the raw XML as elements like <AbstractText Label="SETTING" NlmCategory="METHODS"> (where the Label is what appears as "Setting"), but the rentrez object after parse_pubmed_xml is merely a list of strings, with the labels stripped away. Inspecting the object, I can't find them anywhere in it, and the rentrez XML code looks like it's just dropped (only querying for \\AbstractText or whatever):

library(fulltext)                                                                                                                                       
library(rentrez)                                                                                                                                        
library(pubchunks)

pmcidSearch = "PMC2910532"
paper     <- entrez_search(db="pubmed", term=pmcidSearch)                                                                                                   
rawXML <- entrez_fetch(db="pubmed", id=paper$id, rettype="xml")                                                                                         
fulltext   <- parse_pubmed_xml(rawXML)                                                                                                                    
abstract <- fulltext$abstract

abstract
# [1] "Modafinil may promote wakefulness by increasing cerebral dopaminergic neurotransmission, which importantly depends on activity of catechol-O-methyltransferase (COMT) in prefrontal cortex. The effects of modafinil on sleep homeostasis in humans are unknown. Employing a novel sleep-pharmacogenetic approach, we investigated the interaction of modafinil with sleep deprivation to study dopaminergic mechanisms of sleep homeostasis."
# [2] "Placebo-controlled, double-blind, randomized crossover study."
# [3] "Sleep laboratory in temporal isolation unit."
# [4] "22 healthy young men (23.4 +/- 0.5 years) prospectively enrolled based on genotype of the functional Val158Met polymorphism of COMT(10 Val/Val and 12 Met/Met homozygotes)."
# [5] "2 x 100 mg modafinil and placebo administered at 11 and 23 hours during 40 hours prolonged wakefulness."
# [6] "Subjective sleepiness and EEG markers of sleep homeostasis in wakefulness and sleep were equally affected by sleep deprivation in Val/Val and Met/Met allele carriers (placebo condition). Modafinil attenuated the evolution of sleepiness and EEG 5-8 Hz activity during sleep deprivation in both genotypes. In contrast to caffeine, modafinil did not reduce EEG slow wave activity (0.75-4.5 Hz) in recovery sleep, yet specifically increased 3.0-6.75 Hz and > 16.75 Hz activity in NREM sleep in the Val/Val genotype of COMT."
# [7] "The Val158Met polymorphism of COMT modulates the effects of modafinil on the NREM sleep EEG in recovery sleep after prolonged wakefulness. The sleep EEG changes induced by modafinil markedly differ from those of caffeine, showing that pharmacological interference with dopaminergic and adenosinergic neurotransmission during sleep deprivation differently affects sleep homeostasis."
str(abstract)
# chr [1:7] "Modafinil may promote wakefulness by increasing cerebral dopaminergic neurotransmission, which importantly depe"| __truncated__ ...
rawXML
# [1] "<?xml version=\"1.0\" ?>\n<!DOCTYPE PubmedArticleSet PUBLIC \"-//NLM//DTD PubMedArticle, 1st January 2019//EN\" \"https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd\">\n<PubmedArticleSet>\n<PubmedArticle>\n    <MedlineCitation Status=\"MEDLINE\" Owner=\"NLM\">\n        <PMID Version=\"1\">20815183</PMID>\n        <DateCompleted>\n            <Year>2010</Year>\n            <Month>09</Month>\n            <Day>23</Day>\n        </DateCompleted>\n        <DateRevised>\n            <Year>2019</Year>\n            <Month>05</Month>\n            <Day>13</Day>\n        </DateRevised>\n        <Article PubModel=\"Print\">\n            <Journal>\n                <ISSN IssnType=\"Print\">0161-8105</ISSN>\n                <JournalIssue CitedMedium=\"Print\">\n                    <Volume>33</Volume>\n                    <Issue>8</Issue>\n                    <PubDate>\n                        <Year>2010</Year>\n                        <Month>Aug</Month>\n                    </PubDate>\n                </JournalIssue>\n                <Title>Sleep</Title>\n                <ISOAbbreviation>Sleep</ISOAbbreviation>\n            </Journal>\n            <ArticleTitle>Effects of modafinil on the sleep EEG depend on Val158Met genotype of COMT.</ArticleTitle>\n            <Pagination>\n                <MedlinePgn>1027-35</MedlinePgn>\n            </Pagination>\n            
# <Abstract>\n                
# <AbstractText Label=\"STUDY OBJECTIVES\" NlmCategory=\"OBJECTIVE\">Modafinil may promote wakefulness by increasing cerebral dopaminergic neurotransmission, which importantly depends on activity of catechol-O-methyltransferase (COMT) in prefrontal cortex. The effects of modafinil on sleep homeostasis in humans are unknown. Employing a novel sleep-pharmacogenetic approach, we investigated the interaction of modafinil with sleep deprivation to study dopaminergic mechanisms of sleep homeostasis.</AbstractText>\n                
# <AbstractText Label=\"DESIGN\" NlmCategory=\"METHODS\">Placebo-controlled, double-blind, randomized crossover study.</AbstractText>\n                <AbstractText Label=\"SETTING\" NlmCategory=\"METHODS\">Sleep laboratory in temporal isolation unit.</AbstractText>\n                
# <AbstractText Label=\"PARTICIPANTS\" NlmCategory=\"METHODS\">22 healthy young men (23.4 +/- 0.5 years) prospectively enrolled based on genotype of the functional Val158Met polymorphism of COMT(10 Val/Val and 12 Met/Met homozygotes).</AbstractText>\n                
# <AbstractText Label=\"INTERVENTIONS\" NlmCategory=\"METHODS\">2 x 100 mg modafinil and placebo administered at 11 and 23 hours during 40 hours prolonged wakefulness.</AbstractText>\n                <AbstractText Label=\"MEASUREMENTS AND RESULTS\" NlmCategory=\"RESULTS\">Subjective sleepiness and EEG markers of sleep homeostasis in wakefulness and sleep were equally affected by sleep deprivation in Val/Val and Met/Met allele carriers (placebo condition). Modafinil attenuated the evolution of sleepiness and EEG 5-8 Hz activity during sleep deprivation in both genotypes. In contrast to caffeine, modafinil did not reduce EEG slow wave activity (0.75-4.5 Hz) in recovery sleep, yet specifically increased 3.0-6.75 Hz and &gt; 16.75 Hz activity in NREM sleep in the Val/Val genotype of COMT.</AbstractText>\n                <AbstractText Label=\"CONCLUSIONS\" NlmCategory=\"CONCLUSIONS\">The Val158Met polymorphism of COMT modulates the effects of modafinil on the NREM sleep EEG in recovery sleep after prolonged wakefulness. The sleep EEG changes induced by modafinil markedly differ from those of caffeine, showing that pharmacological interference with dopaminergic and adenosinergic neurotransmission during sleep deprivation differently affects sleep homeostasis.</AbstractText>\n            </Abstract>\n ...

That is, the abstract elements 1:7 are missing their corresponding c("Study Objectives", "Design", "Setting", "Participants", "Interventions", "Measurements and Results", "Conclusions") labels. The result then looks like:

Modafinil may promote wakefulness by increasing cerebral dopaminergic neurotransmission, which importantly depends on activity of catechol-O-methyltransferase (COMT) in prefrontal cortex. The effects of modafinil on sleep homeostasis in humans are unknown. Employing a novel sleep-pharmacogenetic approach, we investigated the interaction of modafinil with sleep deprivation to study dopaminergic mechanisms of sleep homeostasis. Placebo-controlled, double-blind, randomized crossover study. Sleep laboratory in temporal isolation unit. 22 healthy young men (23.4 +/- 0.5 years) prospectively enrolled based on genotype of the functional Val158Met polymorphism of COMT(10 Val/Val and 12 Met/Met homozygotes). 2 x 100 mg modafinil and placebo administered at 11 and 23 hours during 40 hours prolonged wakefulness. Subjective sleepiness and EEG markers of sleep homeostasis in wakefulness and sleep were equally affected by sleep deprivation in Val/Val and Met/Met allele carriers (placebo condition). Modafinil attenuated the evolution of sleepiness and EEG 5-8 Hz activity during sleep deprivation in both genotypes. In contrast to caffeine, modafinil did not reduce EEG slow wave activity (0.75-4.5 Hz) in recovery sleep, yet specifically increased 3.0-6.75 Hz and > 16.75 Hz activity in NREM sleep in the Val/Val genotype of COMT. The Val158Met polymorphism of COMT modulates the effects of modafinil on the NREM sleep EEG in recovery sleep after prolonged wakefulness. The sleep EEG changes induced by modafinil markedly differ from those of caffeine, showing that pharmacological interference with dopaminergic and adenosinergic neurotransmission during sleep deprivation differently affects sleep homeostasis.

Not good.

I didn't find anything in the docs or Google about rentrez having alternative ways to parse the PMC XML I am supposed to be using here, it seems to be parse_pubmed_xml or bust.

gwern added a commit to gwern/gwern.net that referenced this issue Apr 11, 2021
…aragraph breaks; but still bad because rentrez deletes section headers, filed bug: ropensci/rentrez#170
@dwinter
Copy link
Member

dwinter commented Apr 12, 2021

Hi @gwern ,

The parse_pubmed_xml is the only specialised parser (for anything, at all, let alone pubmed XML) in rentrez. I haven't run into these sectioned abstracts before, and because the data is only included in the Label they will indeed be dropped.

I can look into whether the parse_pubmed_xml can handle these better while keeping others working OK. In the meantime, I would parse the XML directly. You could get the label and the text separately, for instance.

library(XML)
parsed_XML <- entrez_fetch(db="pubmed", id=paper$id, rettype="xml", parsed=TRUE) 
labels <-  sapply(parsed_XML["//Abstract/AbstractText"], xmlGetAttr, "Label")
text <-  sapply(parsed_XML["//Abstract/AbstractText"], xmlValue)

@gwern
Copy link
Author

gwern commented Apr 12, 2021

Thanks for the XML code. That works well for me, as I can combine it with the text nicely without too much formatting code:

    ...
    library(XML)
    library(tools)
    parsed_XML <- entrez_fetch(db="pubmed", id=paper$id, rettype="xml", parsed=TRUE)
    labels <-  sapply(parsed_XML["//Abstract/AbstractText"], xmlGetAttr, "Label")
    labelsFormatted <- sapply(tolower(labels), 
      function(s) { paste0("<strong>", toTitleCase(s), "</strong>"); })
    text <-    sapply(parsed_XML["//Abstract/AbstractText"], xmlValue)
    combined <- paste0(paste0(labelsFormatted, rep(": ", length(labelsFormatted)), text), collapse="\n\n")
    ...
    abstract <- { if (length(labels) > 1) { combined; } else { fulltext$abstract; } }

which gives my necessary results like "<p><strong>Background</strong>: Cannabis from hemp (Cannabis sativa and C. indica) is one of the most common illegal drugs used by drug abusers. Indian cannabis contains around 70 alkaloids, and delta-9-tetrahydrocannabinol (delta-9-THC) is the most psychoactive substance. Animal intoxications occur rarely and are mostly accidental. According to the US Animal Poison Control Center, cannabis intoxication mostly affects dogs (96%). The most common cause of such intoxication is unintentional ingestion of a cannabis product, but it may also occur after the exposure to marijuana smoke.</p> <p><strong>Case Presentation</strong>: A 6-year-old Persian cat was brought to the veterinary clinic due to strong psychomotor agitation turning into aggression. During hospitalisation for 14\160days, the cat behaved normally and had no further attacks of unwanted behaviour. It was returned to its home but shortly after it developed neurological signs again and was re-hospitalised. On presentation, the patient showed no neurological abnormalities except for symmetric mydriasis and scleral congestion. During the examination, the behaviour of the cat changed dramatically. It developed alternate states of agitation and apathy, each lasting several minutes. On interview it turned out that the cat had been exposed to marijuana smoke. Blood toxicology tests by gas chromatography tandem mass spectrometry revealed the presence of delta-9-tetrahydrocannabinol (THC) at 5.5\160ng/mL, 11-hydroxy-delta-9-THC at 1.2\160ng/mL, and 11-carboxy-delta-9-THC at 13.8\160ng/mL. The cat was given an isotonic solution of NaCl 2.5 and 2.5% glucose at a dose of 40\160mL/kg/day parenterally and was hospitalised. After complete recovery, the cat was returned to it\8217s owner and future isolation of the animal from marijuana smoke was advised.</p> <p><strong>Conclusions</strong>: This is the first case of a delta-9-tetrahydrocannabinol intoxication in a cat with both description of the clinical findings and the blood concentration of delta-9-THC and its main metabolites.</p>" etc.

You may not have noticed the section monkey business if you don't work with abstracts much, or check them against the PMC version, but they seem to be reasonably common. (I had vaguely noticed the issue before but had put off thinking about it until a reader complained about how unreadable solid blocks of text were for some PMC links.) Watching my rebuild, there were at least 89 PMC links on gwern.net affected by the section omission, so those link annotations will be much more readable now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants