A measurement relation extractor
Marve leverages grobid-quantities and Stanford CoreNLP to extract and normalize measurements and their related entities and descriptors from natural language text.
sample = "The patient returned to Europe at 28 weeks of gestation."
#Simplified output
#-----------------
value: "28"
unit: "weeks"
quantified: "gestation"
related: ["patient", "Europe"]
Marve employs grobid-quantities to find measurement values, units, and a limited set of "quantified" substances using linear CRF models trained on a labeled corpus of text. CoreNLP is then used to link measurements to related words in the sentence using word dependencies and POS tags from CoreNLP. Common dependency/POS patterns relating measurements to other words/entities are specified in /marve/dependency_patterns.json
and can be adjusted without modifying code.
Running Marve requires grobid-quantities and CoreNLP to be running:
Download and unzip CoreNLP:
curl -LOk "http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip" | unzip <path to CoreNLP>/stanford-corenlp-full-2016-10-31.zip
Run (requires Java 8):
cd <path to CoreNLP>/stanford-corenlp-full-2016-10-31 | java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000
Install grobid by following instructions here
Then follow grobid-quantities instructions to install, build, train, and run
Install Marve:
clone this repo or pip install marve
Once both CoreNLP and grobid-quantities are running, Marve can be used as such:
# coding: utf-8
from marve import Measurements as m
# Strings longer than a paragraph should be split before passing to Marve
_test = "The patient returned to Europe at 28 weeks of gestation."
coreNLP = "http://localhost:9000"
grobid = "http://localhost:8080"
patterns = "dependency_patterns.json"
write_to = "sample_output.txt"
m.extract(_test, coreNLP, grobid, patterns, write_to, show_graph=False, pretty=True, simplify=False)
(This example can be found in sample.py
)
IMPORTANT: Text should be in sentence or paragraph chunks before passing to Marve.
Note : The first time Marve is run, a timeout error might be thrown due to longer CoreNLP model loading times. If this happens, run again and CorenLP should run properly.
Marve will only return words related to measurements if they meet criteria laid out in the dependency pattern file /marve/dependency_patterns.json
.
Take the phrase "a spatial resolution of 10m"
. Marve uses a graph to represent each sentence, where edges are the dependencies between words (represented in green ovals below) and nodes are words and their part-of-speech (POS) labels (represented in blue).
There are a handful of general patterns that relate measurement units, values, and other related words or entities in a sentence. For instance, units are generally connected to values via the numerical modifier ("nummod"
) dependency (see above). Nominal modifiers ("nmod"
) is then a common dependency linking units to the thing being quantified. Common patterns linking values, units, and related words have been defined in /marve/dependency_patterns.json
, and the bit of JSON that would match "m"
to "resolution"
in our above example is:
"nmod": {
"enhanced": true,
"of":{
"measurement_types": ["space_between", "attached"],
"pos_in":{
"NN": null
}
}
}
Here's how this example matches:
nmod
is the dependency type betweenm
andresolution
- Since we utilized CoreNLP's enhanced dependency parser, we also see
:of
attached on the end ofnmod
. Since enhanced is set totrue
, theof
must be attached to the dependency - Because the measurement is
10m
we identify it as being anattached
measurement_type
"pos_in"
forces the part of speech of the attached node to contain at least one of its keys. In this case, "NN" means the part of speech must be a noun (valid POS tags could be:NN
,NNS
,NNP
,NNPS
). SinceNN
's value is null, we are finished and can return resolution as a related word and add it to the output. For some POS tags such asVB
, we might need to continue traversing edges in the graph, in which case the value could specify a function to be called (e.g.get_cousin()
)
All such dependency patterns listed in the JSON will be evaluated and if there are any matching patters, they will be added as related words for a measurement.
extract(content, corenlp, grobid, dependency_patterns_file, output_file=None, show_graph=False, pretty=False, simplify=False)
Returns extracted measurements from a sentence or paragraph.
Parameters: content: string
Sentence or paragraph to extract measurements from.
corenlp: string
CoreNLP server endpoint (e.g. "http://localhost:9200").
grobid: string
Grobid server endpoint (e.g. "http://localhost:8080").
dependency_patterns_file: string
Filepath to JSON file containing valid dependency/POS patterns for
extracting words and entities related to measurements.
output_file: string, optional
File to write extracted measurement output to.
show_graph: boolean, optional
If True, a visualization of the dependency and POS network graph
will be displayed for each sentence parsed.
pretty: boolean, optional
If True, JSON written to file will be indented. If False, one extraction
doc per line will be written to the output file
simplify: boolean, optional
If True, only the measurement, unit, and related words of the extracted
output will be returned and written to the output file (see 'Output Options'
section for more detail).
Returns: dict: see "Output Options" below
{"value": 6, "unit": "year", "quantified": {}, "related": {"period": ["study"]}}
{
"type": "value",
"quantity": {
"parsedValue": 6,
"rawValue": "six",
"rawUnit": {
"offsetStart": 13,
"offsetEnd": 14,
"tokenIndices": [
"3"
],
"after": " ",
"name": "year"
},
"offsetEnd": 131,
"offsetStart": 128,
"tokenIndex": 24,
"type": "time"
},
"related": [
{
"rawName": "period",
"connector": "",
"offsetEnd": 21,
"relationForm": "compound",
"offsetStart": 15,
"tokenIndex": 5,
"descriptors": [
{
"rawName": "year",
"tokenIndex": "4"
}
]
}
]
}
If you use this work, please cite:
@inproceedings{hundmanmarve17,
title={Measurement Context Extraction from Text: Discovering Opportunities and Gaps in Earth Science},
author={Hundman, Kyle and Mattmann, Chris A},
booktitle={Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
year={2017},
organization={ACM}
}
Marve is distributed under Apache 2.0 license.
Contact: Kyle Hundman ([email protected])
- Chris Mattmann, JPL
- Sonny Koliwad, JPL
- Jason Hyon, JPL
- Ian Colwell, JPL