orfeo-importer

This program imports texts with linguistic annotations and generates outputs based on selected features of the annotated text. It reads files in CoNLL 2007, Macaon and TEI formats, generally merging information from several files (e.g. dependency trees from CoNLL or Macaon, metadata from TEI, time alignment information from TEI or Macaon). It then produces output in three formats: relAnnis 3.2 for importing into ANNIS; HTML as stand-alone pages for each sample; and index values for Apache Solr for text search, best suited for use with the associated Solr-based web search interface.

This program was created within the project ANR ORFEO. (The project is unrelated to a number of similarly named projects such as the Orfeo ToolBox library.)

Dependencies

Metadata is handled by a orfeo-metadata, a Ruby gem in a separate repository, which should be installed first before running this importer. The gem contains a default metadata model, but new ones can be defined using a simple column-based text file. See the metadata repository for details. Note: The metadata definitions used by the importer must match those used by the text search interface for the latter to function at all.

The directory data/files includes Javascript components by other authors:

jQuery is used for a lot of things
ProgressBar.js is used for the load progress indicator
Arborator is used to draw dependency trees; it in turn uses Raphaël
HTML5 Audio Read-Along is used for the aligned audio player

Configuration files

Default values can be defined for all options. A file can be created for each corpus to define extra information to be displayed.

Default settings

Default values for settings can be defined in a YAML file named settings.yaml in the directory where the importer is run. These values can still be overridden on the command line.

It is particularly advisable to store values that seldom change, like the base URLs of ANNIS and the sample pages, in the YAML file, so that they need not be specified every time the script is invoked.

Corpus information files

Contained in the directory data/corpora, corpus information files must be named corresponding to the directories where input files appear, e.g. example.txt for information about a corpus read in from a directory named example. The content is read line by line, each line having different semantics. Currently there are four lines:

Name of the corpus, formatted for readability (e.g. "C-Oral-Rom" rather than "coralrom").
URL to the homepage of the corpus or project
Filename of a logo to be displayed for the corpus
An abstract describing the corpus

Unused lines may be left empty, but must not be omitted to maintain line numbering.

It is not obligatory to define corpus information for every corpus, but in the absence of an information file, the corpus information panel in the sample page will be virtually empty.

License

GPL v3; see file LICENSE for full text of the license.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
data		data
lib		lib
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
import.rb		import.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

orfeo-importer

Dependencies

Configuration files

Default settings

Corpus information files

License

About

Releases

Packages

Languages

License

clement-plancq/orfeo-importer

Folders and files

Latest commit

History

Repository files navigation

orfeo-importer

Dependencies

Configuration files

Default settings

Corpus information files

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages