-
Notifications
You must be signed in to change notification settings - Fork 24
Grounders
In Eidos, all grounding is done against an ontology. Internally, each ontology is represented as a DomainOntology
. The DomainOntology
still has the original yaml ontology content. That is, it still contains the String
examples etc. However, this format isn't particularly computationally efficient, so instead of interacting with it directly, we use an OntologyGrounder
(note, in practice all grounders extend the EidosOntologyGrounder
subclass). The EidosOntologyGrounder
is initialized with a DomainOntology
, and also derives more efficient representations of the ontology node examples and patterns:
-
conceptEmbeddings
: these are the internal representations of the ontology nodes in terms of their examples. Each node has aConceptEmbedding
, which essentially has a name and a vector representation that is more or less the average of the example/definition terms. See class definition for more detail. -
conceptPatterns
: similar to above, each node in the ontology here is represented in terms of its (optional) regex. Thus, eachConceptPattern
essentially has a name and an optional array ofRegex
.
There are essentially two different Grounders in active use, selected here:
- Flat grounding is done with the
InterventionSieveGrounder
- Compositional grounding is done with the
SRLCompositionalGrounder
Flat grounding is way simpler than compositional grounding, but pushes the complexity to the ontology. That is, more complex concepts must each have their own node in the ontology, and a given mention is aligned to one at a time. (In practice we return the top k as there's always some uncertainty involved in automated grounding).
The WM flat ontology has two main "branches" -- the interventions and the main ontology. Since the interventions were so numerous and at such a different level of granularity, the grounding algorithm first considers the main branch of the ontology, and only looks at the intervention branch if certain conditions are met.
The grounding essentially:
- first tries to match a main branch ontology node regex. If one matches, then it is considered to be a perfect match, and the node is selected.
- If there are no main branch regex matches, then the word embedding representation of the mention (the average of the embeddings for the tokens in the canonical name) is compared to all main branch nodes (their embedding representation is similarly the average of all non-stop words in the examples and definition).
- Then, the grounder checks to see if the mention should also be grounded against the intervention branch. This currently is done through pattern matching, using these patterns plus any included in that branch of the ontology.
- If allowed to match against the intervention branch, then again first the branch is checked for node regex matches and then embedding matches.
- If the algorithm didn't yet return a grounding, then finally, all groundings are combined, sorted, and the top k are returned.
Compositional grounding is a much more complicated algorithm, as the complexity burden has shifted from the ontology to the grounding process. To understand the approach, first you should understand the compositional grounding representation. Each is a 4-tuple, where the slots represent (in order):
- the theme of the concept
- any property of that theme
- the process that applies to or acts on the theme, if any
- any property of that process
The case class that stores this representation is a PredicateTuple
. This is the primary target of the compositional grounder: given an EidosMention, return one or more PredicateTuples (which are wrapped into a PredicateGrounding
for consistency. Note you can simplify this if desired! Currently, in addition to each of the individual groundings for the 4 slots having scores, the overall tuple also has an aggregated score.
In order to generate these compositional groundings, we rely heavily on the Semantic Role Labeling (SRL) provided through the CluLab processors library. The approach works as follows:
if no predicatesAndArguments:
// First try grounding to Property branch
groundToBranches(text, Property)
// If we use up everything, just return a Property grounding
if (nothing left):
return groundings
stop
// Otherwise, continue trying to ground to Concept and Process branches w/ remaining text
else:
groundToBranches(remainingText, Seq(Concept, Process))
return groundings
stop
if predicatesAndArguments:
// Try to ground the entire mention text before separating into preds/args individually
// Match the mention text against the node names in the ontology
// For example, mention try go ground "climate change" as a whole before splitting into "climate" and "change"
// This will match the existing ontology node "climate_change", which is not the top grounding for either pred/arg separately
// If the entire mention text matches multiple nodes, keep the match that uses most material from the mention
exactMatchGrounding = getExactMatches(text)
// If we use up everything, just return the exact match grounding
if (nothing left):
return exactMatchGrounding
stop
// Otherwise, continue trying to ground the remaining text
else:
allGroundings = [exactMatchGrounding]
for (item <- remainingText):
isArgument = true if (incoming SRL edges AND no outgoing SRL edges)
isPredicate = true if (no incoming SRL edges OR there are outgoing SRL edges)
// try to ground to Property branch first
groundToBranches(item, Property)
if properties nonEmpty:
allGroundings.append(grounding)
continue to next item
// if it's not a Property, move on
else:
// If it's an argument, ground to Concept
if isArgument:
groundToBranches(item, Concept)
allGroundings.append(grounding)
// If it's a predicate, ground to Process
elif isPredicate:
groundToBranches(item, Process)
allGroundings.append(grounding)
return allGroundings
def groundToBranches(mentionText, branches):
// get exact matches based on node name
val exactMatchGroundings = getExactMatches
// get pattern matches based on regex patterns
val patternMatchGroundings = getPatternMatches
// proceed to w2v
// get min edit distance for each node
getExampleScores
// get w2v similarity score for each node
getEmbeddingScores
// use function to get the combination score of edit distance and cosine similarity for each node
val w2vGroundings = comboScore(getExampleScores, getEmbeddingScores)
// return all matched groundings; will be filtered and sorted later
return exactMatchGroundings ++ patternMatchGroundings ++ w2vGroundings
Everything below is an older implementation of the compositional grounding algorithm, kept for historical reasons
- Given a mention (which corresponds to a span from the sentence), the valid predicates are found. These are the tokens (a) identified by the SRL as predicates and (b) occur within the span of the mention.
- If no valid predicates are found, we consider the mention span to be essentially a theme, though we check for any properties which may be there and attach them if found. This results in a 4-tuple that has either only one slot filled (the theme) or two (when the theme has a property).
- However, if there are predicates, we consider each of them in order of shortest graph distance to the syntactic root of the sentence. For each we:
- Generate distinct paths through the SRL graph from that predicate to each of its leaf themes.
- Iterate through each of these paths and look for properties. If any are found, we convert the property to being "attached" to its theme (in SRL properties tend to be predicates), and pop it from the path list.
- Iterate through the remaining nodes in the SRL graph path, again, starting from the deepest node and create the tuple. The deepest node is taken as the theme, and if there is a second deepest it is taken as the process. Since properties were previously converted to attachments on the nodes, at this point if there are any they are included in the 4-tuple. If there was only one thing in the list, then it is by default the theme.