Skip to content
/ marve Public

For extracting measurements and related entities from text

License

Notifications You must be signed in to change notification settings

khundman/marve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Marve

A measurement relation extractor

License

Marve leverages grobid-quantities and Stanford CoreNLP to extract and normalize measurements and their related entities and descriptors from natural language text.

	
	sample = "The patient returned to Europe at 28 weeks of gestation."

	#Simplified output
	#-----------------
	value: "28"
	unit: "weeks"
	quantified: "gestation"
	related: ["patient", "Europe"]

Marve employs grobid-quantities to find measurement values, units, and a limited set of "quantified" substances using linear CRF models trained on a labeled corpus of text. CoreNLP is then used to link measurements to related words in the sentence using word dependencies and POS tags from CoreNLP. Common dependency/POS patterns relating measurements to other words/entities are specified in /marve/dependency_patterns.json and can be adjusted without modifying code.

Installation

Running Marve requires grobid-quantities and CoreNLP to be running:

Download and unzip CoreNLP:

curl -LOk "http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip" | unzip <path to CoreNLP>/stanford-corenlp-full-2016-10-31.zip

Run (requires Java 8):

cd <path to CoreNLP>/stanford-corenlp-full-2016-10-31 | java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000

Install grobid by following instructions here

Then follow grobid-quantities instructions to install, build, train, and run

Install Marve:
clone this repo or pip install marve

Usage

Once both CoreNLP and grobid-quantities are running, Marve can be used as such:

# coding: utf-8

from marve import Measurements as m

# Strings longer than a paragraph should be split before passing to Marve
_test = "The patient returned to Europe at 28 weeks of gestation."

coreNLP = "http://localhost:9000"
grobid = "http://localhost:8080"
patterns = "dependency_patterns.json"
write_to = "sample_output.txt"

m.extract(_test, coreNLP, grobid, patterns, write_to, show_graph=False, pretty=True, simplify=False)

(This example can be found in sample.py)

IMPORTANT: Text should be in sentence or paragraph chunks before passing to Marve.

Note : The first time Marve is run, a timeout error might be thrown due to longer CoreNLP model loading times. If this happens, run again and CorenLP should run properly.

Dependency and Part-of-Speech Patterns

Marve will only return words related to measurements if they meet criteria laid out in the dependency pattern file /marve/dependency_patterns.json.

Take the phrase "a spatial resolution of 10m". Marve uses a graph to represent each sentence, where edges are the dependencies between words (represented in green ovals below) and nodes are words and their part-of-speech (POS) labels (represented in blue).

example

There are a handful of general patterns that relate measurement units, values, and other related words or entities in a sentence. For instance, units are generally connected to values via the numerical modifier ("nummod") dependency (see above). Nominal modifiers ("nmod") is then a common dependency linking units to the thing being quantified. Common patterns linking values, units, and related words have been defined in /marve/dependency_patterns.json, and the bit of JSON that would match "m" to "resolution" in our above example is:

"nmod": {
    "enhanced": true,
    "of":{
        "measurement_types": ["space_between", "attached"],
        "pos_in":{
            "NN": null
        }
    }
}

Here's how this example matches:

  1. nmod is the dependency type between m and resolution

  2. Since we utilized CoreNLP's enhanced dependency parser, we also see :of attached on the end of nmod. Since enhanced is set to true, the of must be attached to the dependency

  3. Because the measurement is 10m we identify it as being an attached measurement_type

  4. "pos_in" forces the part of speech of the attached node to contain at least one of its keys. In this case, "NN" means the part of speech must be a noun (valid POS tags could be: NN, NNS, NNP, NNPS). Since NN's value is null, we are finished and can return resolution as a related word and add it to the output. For some POS tags such as VB, we might need to continue traversing edges in the graph, in which case the value could specify a function to be called (e.g. get_cousin())

All such dependency patterns listed in the JSON will be evaluated and if there are any matching patters, they will be added as related words for a measurement.

API

extract(content, corenlp, grobid, dependency_patterns_file, output_file=None, show_graph=False, pretty=False, simplify=False)

Returns extracted measurements from a sentence or paragraph.

Parameters: 	content: string
						Sentence or paragraph to extract measurements from.

				corenlp: string
						CoreNLP server endpoint (e.g. "http://localhost:9200"). 

				grobid: string
						Grobid server endpoint (e.g. "http://localhost:8080").

				dependency_patterns_file: string
						Filepath to JSON file containing valid dependency/POS patterns for 
						extracting words and entities related to measurements.

				output_file: string, optional
						File to write extracted measurement output to.

				show_graph: boolean, optional
						If True, a visualization of the dependency and POS network graph 
						will be displayed for each sentence parsed.

				pretty: boolean, optional
						If True, JSON written to file will be indented. If False, one extraction 
						doc per line will be written to the output file

				simplify: boolean, optional 
						If True, only the measurement, unit, and related words of the extracted 
						output will be returned and written to the output file (see 'Output Options' 
						section for more detail).

Returns:		dict: see "Output Options" below

Output Options

simple=True

{"value": 6, "unit": "year", "quantified": {}, "related": {"period": ["study"]}}

simple=False, pretty=False

{
	"type": "value",
	"quantity": {
		"parsedValue": 6,
		"rawValue": "six",
		"rawUnit": {
			"offsetStart": 13,
			"offsetEnd": 14,
			"tokenIndices": [
				"3"
			],
			"after": " ",
			"name": "year"
		},
		"offsetEnd": 131,
		"offsetStart": 128,
		"tokenIndex": 24,
		"type": "time"
	},
	"related": [
		{
			"rawName": "period",
			"connector": "",
			"offsetEnd": 21,
			"relationForm": "compound",
			"offsetStart": 15,
			"tokenIndex": 5,
			"descriptors": [
				{
					"rawName": "year",
					"tokenIndex": "4"
				}
			]
		}
	]
}

Citation:

If you use this work, please cite:

@inproceedings{hundmanmarve17,
  title={Measurement Context Extraction from Text: Discovering Opportunities and Gaps in Earth Science},
  author={Hundman, Kyle and Mattmann, Chris A},
  booktitle={Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  year={2017},
  organization={ACM}
}

License

Marve is distributed under Apache 2.0 license.

Contact: Kyle Hundman ([email protected])

Acknowledgements

About

For extracting measurements and related entities from text

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages