-
Notifications
You must be signed in to change notification settings - Fork 16
CLARIN Standards Information System
CLARIN Standards Information System
Please note that this very page reflects mostly the state of the system in 2014. Some changes since then have been reflected below, but proceed with caution.
CLARIN Standard Guidance is a website providing general information about standards used particularly in the areas of linguistics and computer linguistics. Since a lot of standards have been developed for various purposes by many different parties, such a portal is useful to give users guidance in choosing a standard suitable for their needs and to compare standards of similar purposes or topics.
This document summarizes information about the system running the website and its content. It describes the installation steps, the system architecture and features, the definition of the schema used and the data contained in the system. In this documentation, the term “standard” and “specification” are used interchangeably.
CLARIN Standards Information System (also referred to as CLARIN Standards Guidance") is a deliverable of a CLARIN-D project conducted at the Institut für Deutsche Sprache (IDS) in Mannheim since 2011 under the supervision of Andreas Witt. The original roles in 2011-2016 were as follows:
- Maik Stührenberg: initial coding and XML schemas; input on some of the features/requirements of the website.
- Antonina Werthmann: content of the website (description of standards and specifications, topics and standard bodies), XML schemas documentation.
- Eliza Margaretha: most of the code, proof-reading of the content, contributed to building the XML schemas.
- Andreas Witt: supervision, advice on the content and features/requirements of the website.
The present documentation was prepared by Antonina Werthmann and Eliza Margaretha and published in several versions between May 2014 and September 2016. Since January 2017, it has become part of the CLARIN-ERIC GitHub repository, available in the space managed by the CLARIN Standards Committee. Since that point, all information on the individual credits and milestones is being recorded in the page history.
For more information, please contact us via the project facility or e-mail Piotr Bański at banski at ids-mannheim dot de.
The present documentation is copyright by its Authors and is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license. The Authors include Antonina Werthmann and Eliza Margaretha (2011-2016) as well as contributors listed in the history of changes to this page and pages dependent on it.
The Standards Information System is copyright by Institut für Deutsche Sprache (IDS) in Mannheim (2011-2016) as well as by contributors to the CLARIN Standards Committee GitHub repository, including the original Authors (Eliza Margaretha, Maik Stührenberg, Antonina Werthmann, Andreas Witt), and is licensed under the terms of the 2-Clause BSD License (BSD-2-Clause). The License file is part of the code distribution.
The SIS software includes Priscilla Walmsley's FunctX XQuery Function Library version 0.1 (2007), distributed under the terms of the GNU Lesser General Public License, version 2.1 (LGPLv2.1).
Up-to-date installation instructions are provided on a separate page.
The system is mainly written in XQuery. The architecture of the system is described in Section 3.1. The system provides features for different kinds of users, which are explained in Section 3.2.
A simple XQuery code may contain all the functions for accessing and processing the XML data, as well as the user interface. However, a separation between the functions and the user interface is useful to have more structured codes, so that the user interface and the application layer become independent to each other. Hence, the changes in the application layer would not affect the user interface.
XQuery is a functional programming language by which we can define functions belonging to an XQuery module. For the separation between the user interface and the application layer, the functions in the application layer can be defined to only manage and generate contents for the user interface, and not to deal with the interface design. On the other hand, the user interface is to contain an abstraction of what should be in the web page, which practically calls the application layer functions.
The system is designed in an MVC (Model-View-Controller) -like architecture. The MVC components have different file extensions representing their functions xqm (Model), xq (View), and xql (Controller).
Since the data is written in XML, it has been modeled as a tree structure which can be well navigated by using XPATH. Thus, the Model component is not responsible to model the data, but only to perform direct interactions with the XML data comparable to database queries, such as selecting, storing and updating nodes. It also defines the paths to the data.
The View component is basically the user interface layer. It describes how the user interface should look like and what they should contain.
The Controller component deals with the nodes selected from the XML files by the Model component, navigates through them, and selects some more detailed information from them. The Controller is also a mediator performing all the operations between the Model and the View. It manages and generates contents for the View.
Users of the website are categorized into three categories based on their roles: guest, registered user and web-admin. Guests can do basics operations on the website such as browsing the standard descriptions or searching for some standards by topics, standard body and soon. Registered users and web-admins have the privilege to submit a standard description, which is then generated into an XML file. The standard description submitted by a registered user is stored in the doc/
folder (see Section 5) and to be reviewed by a web-admin. The standard description submitted by a web-admin submit, however, is directly stored in specifications/ folder. Since the generated XML file does not go through a validation process against its schema (see Section 17.1.1.1), the web-admin should check it personally. Additionally, web-admins can edit the standard descriptions.
The following functions have been implemented in the system:
- User registration
- Login
- Browsing (standards, standard bodies, standard topics)
- Searching for standards
- Submitting standard (including parts and versions)
- Editing standard description (including the standard parts and versions)
- Tag-clouds of standards on the homepage,
- Tag-clouds of relevant keywords on the standard pages
- Standard relation graphs at standard pages and the standard list page.
- Standard body relation graphs
The system database contains all the codes running the system, the XML data collection, and the XML schemas structuring the XML data. The database is run by eXist-db. Figure 4 shows the directory tree of the database opened in Oxygen XML editor. The root directory is /db/apps/clarin.
The /data folder contains the collection of documents describing the standards, standard bodies, topics, user information and other documents referred in the website contents, such as examples of standard applications. The standards, standard bodies, topics and user information are written in XML. The schemas for the XML data are described in Section 14.1.
The doc/ folder consists of any kinds of documents adding extra information to the standard descriptions. For instance, it contains examples of standard applications, such as annotation in MAF; or the journal articles, conference papers, reviews about the standards. All the documents in this folder must be legally available to be publicly shared. Thus, it is important to check the licence or copyright of the documents beforehand.
The review/ folder contains the standard descriptions submitted by users. These documents are to be reviewed by an administrator. The administrator should verify the document content before it can be moved to the specifications/ folder.
Figure 1: Database Structure
The specifications/ folder contains various standard descriptions written in XML. The schema for the XML is described in detail in Section 17.1.1.1.
The standards body XML data provides information about each organization or group of experts that develop the standards listed in the specifications/ folder. The description of standard bodies in sbs.xml is defined according to the XML Schema Definition in Section 17.1.1.2.
Topic describes the conceptual subject or area of interest, in which a standard was/is developed, or in which areas its development and use are particularly important. Like keywords, topics help to find similar standards or standards in the same area.
The topics are listed in topics.xml according to the schema described in Section 17.1.1.3.
The users.xml contains information about registered users.
The edit/ folder contains XQuery controller codes for editing standard descriptions. The editing is done via AJAX. The JavaScript on the client side handles the editing request and response. The controller receives the editing request from JavaScript, and sends the results back to the JavaScript. The JavaScript will then update the web page according to the results.
The model/ folder contains the XQuery model codes facilitating direct interactions (e.g. select, insert, update) with the XML data.
The modules/ folder contains XQuery controller codes for manipulating XML data and generating contents of the web pages.
The resources/ folder contains the additional files the system needs, such as images, CSS Stylesheets, Javascript codes and XQuery libraries.
The css/ folder contains Cascading Style Sheets for designing the web-pages and the tag-clouds.
The images/ folder contains all the image files used in the system.
The lib/ folder contains the libraries used by the XQuery codes.
The scripts/ folder contains Javascripts for the tag-clouds, the graph visualizations and some general functions used in editing standard descriptions. The tag clouds use the tagcanvas library, and the graph visualizations use the D3 library. Besides, Tinymce is used as the XML editor for writing the description element of the standards, and Dijit ComboBox for choosing an existing standard body or adding a new organization for the responsible statement element of a standard.
The schemas/ folder contains all the XML schemas structuring the XML data collection. There are three schemas used in the system.
For more information about the XML XSD Data, please visit the web site of W3C guidelines.
For more information about the XML Catalog Data, please visit the web site of the OASIS guidelines.
The specification schema (spec.xsd) defines the general structure and the elements of the standard XML files, the list of standard topics (topic.xml), and the list of standard bodies (sbs.xml).
The purpose of this part of XML Schema is to define all the standard XML files in the specifications/ folder. The root node of a standard XML file is <spec>.
A standard XML file has the root node <spec> and it has four attributes: @id, @display, @topic and @standardSettingBody. The @id defines the identifier of the standard file and must starts with “Spec”. The @topic designates the topic ids of the standard topics. Multiple topic ids are separated with a space. The @standardSettingBody designates the current standard body managing the standard.
The <spec> must contains a <titleStmt>, a <scope>, at least one <info> and at least one <part> or <version>. Optionally, it can also contain the following elements: <keyword>, <features>, <address>, <relation> and <asset>. The elements in <spec> must follow a certain order, namely <titleStmt>, <scope>, <keyword>, <info>, <features>, <address>, <relation>, <part>, <version>, <asset>.
The element <titleStmt> stands for title statement. The node <titleStmt> contains information about the title <title>, abbreviation <abbr>, and responsible statement <respStmt>. The title and abbreviation are obligatory, whereas the responsible statement is optional. The <title> node is obligatory in the <spec> and <part>, but optional in <version>, because a standard version does not always have a title.
The node <titleStmt> also appears in the <sbs> (see Section 17.1.1.2) and <topic> (see Section 17.1.1.3).
For the tag clouds and relation graphs, abbreviations of standards and versions are necessary. Therefore the <abbr> is obligatory in <spec> and <version> (but not in <part>). However, the abbreviation of a standard and a version is not always available. In this case, an abbreviation must be created for the use in the website and must be marked with the attribute value of @internal set to “yes”. The creation of a version abbreviation should be in the format [part-abbr]-[version-year].
A <titleStmt> can have more than one <respStmt>, therefore each <respStmt> must have an @id attribute. The @id is needed, for instance to select which <respStmt> to update or remove. A <respStmt> contains a <resp> and at least one <name>. The <resp> designates the types of responsibility whose value is restricted to editor, author, publisher, convenor, chair and secretary. The <name> designates the name of the responsible entity(s) and has the attribute type which can be an organization with the element <org> or a person with the element <person>. If the responsible entity happens to be a standard body listed in the sbs.xml, the id of the standard body must be the value of @id of the <name>. This is necessary to create a reference or a link to the standard body page.
A <scope> describes the purpose of a standard and to what extent it is useful. It may be similar to a standard topic, but it is not limited to a pre-defined set of areas.
The <keyword> nodes signify important hints about a standard. The standard abbreviation is not allowed to be a keyword because it creates redundancy in the standard tag cloud.
An <info> has different functions based on its @type. For instance
- <info type=”description”> contains general textual information about the standard.
- <info type=”recReading”> contains a bibliography or references to related papers, which are recommended to be read, and are defined by the element <biblStruct>.
The <biblStruct> node defines a reference to a paper or a book about the standard. The bibliographic structure of the node is adopted from the TEI P5 Guidelines.
A <features> node defines the information about technical and formal aspects of a standard. Here can be specified, what meta language (SGML vs. XML), respective grammar class or the notation (inline vs. standoff) are used, what the constraint language defines the markup language, or other information, that can be relevant for a standard. The description of the feature set adopts the principles of TEI feature structure representation.
The node <features> has an optional @name attribute for its features name and can contain the elements <fs> or <vColl>.
The <fs> stands for “feature structure” and can be used to represent different kinds of information. The <fs> element has an optional @type attribute, which indicates the type of feature structure it represents. An <fs> element groups a sequence of feature-value pairs together. A feature is defined as an element <f> with a @name attribute indicating the name of the feature and any number of associated values, such as <binary>, <symbol>, <numeric>, and <string>.
The <vColl> element stands for “collection of values”. It allows the encoding of lists, sets and bags (i.e., multisets) of the values.
An <address> element refers to a URL or a postal address.
A <relation> node describes an association between two standards or standard versions. A <relation> has two attributes: @type signifying the kind of relation such as isVersionOf and @target signifying the target of the relation. The relation types are defined in the XML Schema (see Section 17). The <relation> nodes are used to create the standard relation graphs.
A standard is sometimes divided into several parts. The <part> node describes the information about a standard part. A part must have an <id> and a <title>. Besides, it can have other elements that a <spec> can have. However, it cannot have a sub-part, thus it must not consist of any <part>. Instead, a part must have at least one version.
A standard is typically published more than one time, because the standard may be improved from time to time. On each publication, a new standard version is delivered. A <version> node describes information about a standard version. A <version> has the attribute @id whose value must starts with “Spec” like a standard id, and @status indicating the current status of the version such as working draft, final and recommendation.
A <version> must have an <abbr> in its <titleStmt> and a published date <date>. Besides, it can contain one or more optional nodes: <versionNumber>, <info>, <features>, <address>, <relation> and <asset>. A <versionNumber> can have @type major or minor. A major version number usually corresponds to a major revision with significant changes in the standard version, whereas a minor one contains only small changes.
The <asset> node of a standard lists links referring to some standard documents, such as examples of standard applications.
A standard body XML file has the root node <sbs> and stands for Standard Body Set. It describes all the standards bodies whose ids are referred in the attribute @standardSettingBody in <spec>.
The <sbs> root element contains child elements <sb> describing information about each standard body individually.
The <sb> has three attributes: @id, @type and @display. The @id defines the identifier for the standard body. It must start with “SB” and in a normal case should be complemented with the short name or acronym of a standard organization, for example “SBISO” for International Organization for Standardization or “SBW3C” for World Wide Web Consortium.
The <sb> must contain a <titleStmt> and an <info>. Optionally it can also contain the elements <address> and <relation>. The elements in <sb> must follow a certain order, namely <titleStmt>, <info>, <address>, <relation>.
Not only a standard organization can be defined as <sb>, but also a technical committee, a subcommittee or a working group in a standard organization. It can be defined in the @type which is optional. A relation between standard organizations can be specified in a <relation> element.
The attribute values “hide” and “show” of @display determines whether the information about standard body will be shown on the web-page or not. For instance, the standard body is hidden when its information is still incomplete.
Each standard can be assigned to one or more topics. These topics should be listed in the <spec> element as the value of the attribute @topic. Multiple topics must be separated with a space character. By means of these topics, standards of similar topics can be grouped together.
The topic XML file has the root node <topics> with child elements <topic>. The element <topic> includes the information about each topic individually and has a mandatory attribute @id. The attribute defines the identifier for the topic, which must start with “Topic” and should be complemented with the short name or acronym of the topic name, for example “TopicSemAnn” stands for Topic Semantic Annotation. A <topic> must include the elements <titleStmt> and <info>.
The search /folder contains XQuery view codes for searching standard descriptions.
The /user folder contains the XQuery view codes for user registration and login.
The /views folder contains XQuery view codes defining the web-pages.
The only controller XQuery needed, for example for URLRewriting, is controller.xql.
The index.xq manages the web homepage.
The website has already a web-admin and a test user accounts. The credentials of the accounts can be obtained from Antonina Werthmann (werthmann at ids-mannheim dot de). Although the website provides a registration feature for new users, adding a new web-admin account has to be done manually. Please set the email address and MD5 encoded password in the user XML Data (see section 4.1.6).
The following tasks are planned to be done in the future:
- Expansion of the standard collection in the specifications/ folder with descriptions of further standards or best-practice guidelines exploited in CLARIN-D project and their relations to the project. For example, the collection lacks the standards for linguistic annotation, metadata annotation, data retention, data archiving and so on.
- Extension and update of the existing standard descriptions including filling any information gaps that may exist.
- Expansion of the collection of the examples in the doc/ folder for the existing standards with direct links to them.
- Addition of actual bibliography entries for the standards.
- Description and completion of missing information for all existing topics in the Section REF _Ref387652078 \r \h9
- Addition of new topics.
- Links to external websites are to be continuously monitored, maintained and updated.
- Extensive testing to make sure that each function works properly.