-
Notifications
You must be signed in to change notification settings - Fork 2
Indexing documents
Index documents annotated with AlvisNLP/ML's module AlvisIRIndexer
.
<alvisnlp-plan id="test">
...
<index class="AlvisIRIndexer">
<indexDir>...</indexDir>
<tokenPositionGap>...</tokenPositionGap>
<fieldNames>...</fieldNames>
<relations>
<rel>arg1,arg2</rel>
...
</relations>
<propertyKeys>...</propertyKeys>
<documents>
...
</documents>
</index>
</alvisnlp-plan>
Path to the directory where the index will be written. Any previously index in this directory will be cleared.
Estimated annotation density.
By default this parameter has a value of 256
.
If indexing fails, try raising this value.
Name of all indexed and stored field, separated by commas. Fields with a name that is not included in this list will not be indexed.
Name of all indexed relations and their arguments. Relations with a name that is not included in this list will not be indexed. Relations with arguments that do not match this property will not be indexed.
Name of all indexed token properties. Properties with a name that is not included in this list will not be stored.
<documents>
<instances>...</instances>
<fields>
...
</fields>
...
An Expression evaluated from the corpus as a list of elements. Each element is indexed as a document.
By default this expression is documents
, meaning all documents in the corpus are indexed.
In order to select certain documents, use an expression like documents[COND]
.
Each field
property specify a set of document fields to store and index.
At least one field specification is mandatory.
<fields
normalization="..."
>
<instances>...</instances>
<field-name>...</field-name>
<field-value>...</field-value>
<tokens>
...
</tokens>
<annotations>
...
</annotations>
<keyword>...</keyword>
</fields>
Normalization filters to apply to each indexed token and annotation in this field.
By default the normalization is case-folding,ascii-folding,english-stemming
.
An Expression evaluated from the document element as a list of elements. Each element is indexed and stored as a field.
By default this expression is sections
, meaning all sections in the document are indexed.
In order to select certain sections, use an expression like sections[COND]
, or sections:NAME
.
An Expression evaluated from the section element as a string. The string result is the name of the field.
By default this expression is @name
, meaning the value of the feature name
.
In order to give the same name to all fields in this field specification, use a string literal like "..."
.
An Expression evaluated from the section element as a string. The string result is the value (contents) of the field.
By default this expression is contents
, meaning the contents of the section.
Setting this property is not recommended, unless using the keyword
property.
Specify the base tokens to index.
By default the base tokens are annotations in the words
layer, using their coordinates in the section contents and their feature form
.
Setting this property is not recommended.
Specify the annotations to index.
<annotations>
<instances>...</instances>
<identifier>...</identifier>
<text>...</text>
<fragments>
...
</fragments>
<arguments>
...
</arguments>
<properties>
...
</properties>
</annotations>
An Expression evaluated from the section element as a list of elements. Each element is an indexed additional token.
In order to select annotations in a layer, use an expression like layer:NAME
.
In order to select tuples in a relation, use an expression like relations:REL.tuples
.
An Expression evaluated from the annotation element as a string. The result is used as the token's identifier.
By default the identifier is id:unique
, a generated identifier unique for the element.
An Expression evaluated from the annotation element as a string. The result is used as the token's value; this is the text that will be effectively indexed.
By default the identifier is @form
, which means the annotation surface form.
In order to work properly in the search engine, the following form is recommended: "{entitytype}" ^ @id
.
entitytype
is the type of entity, and id
the feature containing the entity concept identifier.
If concepts are organized in a hierarchy, then this feature should be the path from the root concept to the entity concept (do not forget to append the separator at the end).
<fragments>
<instances>...</instances>
<start>...</start>
<end>...</end>
</fragment>
instances
is an Expression evaluated from the annotation element as a list of elements.
Each result element represents a fragment in the text.
By default instances
is $
, meaning a single fragment represented by the annotation element itself.
If the indexed annotation element is represented as a tuple, then set instances
to sort:ival(args, start)
, the tuple arguments ordered by occurrence in the text.
start
and end
are expressions evaluated from the fragment element as numbers.
They represent the position of the fragment in the field contents.
By default they are start
and end
, the annotation coordinates.
<arguments>
<arg1>...</arg1>
<arg2>...</arg2>
</arguments>
Specify the relation arguments if this annotation represents a relation.
The tags arg1
and arg2
must match exactly the argument names declared in the parameter relations
.
The values are Expression evaluated from the annotation element as a string.
This string is the identifier of the indexed annotation that represents the argument.
Usually it is args:ROLE.id:unique
.
<properties>
<key1>...</key1>
<key2>...</key2>
...
</properties>
Specify the annotation properties, if any.
The tags key1
, key2
, etc. must match exactly the argument names declared in the parameter propertyKeys
.
The values are Expression evaluated from the annotation element as a string. This string is the value of the property.
An Expression evaluated from the field element as a string. The field will contain a single token specified by this expression.
Keyword fields are useful for storing information about the document that may be retrieved but not necessarily indexed and queried.