Skip to content

orfeo-treebank/orfeo-importer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

orfeo-importer

This program imports texts with linguistic annotations and generates outputs based on selected features of the annotated text. It reads files in CoNLL 2007, Macaon and TEI formats, generally merging information from several files (e.g. dependency trees from CoNLL or Macaon, metadata from TEI, time alignment information from TEI or Macaon). It then produces output in three formats: relAnnis 3.2 for importing into ANNIS; HTML as stand-alone pages for each sample; and index values for Apache Solr for text search, best suited for use with the associated Solr-based web search interface.

This program was created within the project ANR ORFEO. (The project is unrelated to a number of similarly named projects such as the Orfeo ToolBox library.)

Dependencies

Metadata is handled by a orfeo-metadata, a Ruby gem in a separate repository, which should be installed first before running this importer. The gem contains a default metadata model, but new ones can be defined using a simple column-based text file. See the metadata repository for details. Note: The metadata definitions used by the importer must match those used by the text search interface for the latter to function at all.

The directory data/files includes Javascript components by other authors:

Configuration files

Default values can be defined for all options. A file can be created for each corpus to define extra information to be displayed.

Default settings

Default values for settings can be defined in a YAML file named settings.yaml in the directory where the importer is run. These values can still be overridden on the command line.

It is particularly advisable to store values that seldom change, like the base URLs of ANNIS and the sample pages, in the YAML file, so that they need not be specified every time the script is invoked.

Corpus information files

Contained in the directory data/corpora, corpus information files must be named corresponding to the directories where input files appear, e.g. example.txt for information about a corpus read in from a directory named example. The content is read line by line, each line having different semantics. Currently there are four lines:

  1. Name of the corpus, formatted for readability (e.g. "C-Oral-Rom" rather than "coralrom").
  2. URL to the homepage of the corpus or project
  3. Filename of a logo to be displayed for the corpus
  4. An abstract describing the corpus

Unused lines may be left empty, but must not be omitted to maintain line numbering.

It is not obligatory to define corpus information for every corpus, but in the absence of an information file, the corpus information panel in the sample page will be virtually empty.

License

GPL v3; see file LICENSE for full text of the license.

About

Import text with linguistic annotations, then convert or index it

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published