Skip to content
Eberhard Beilharz edited this page May 30, 2017 · 13 revisions

SIL.WritingSystems contains classes for managing and persisting writing systems. It defines a writing system model that is based on the Locale Data Markup Language (LDML) format. It provides functionality for reading and writing the model to LDML. It also provides classes for managing a collection of writing system definitions. Writing systems are identified using IETF language tags. Lastly, the library provides an API for accessing the SIL Locale Data Repository (SLDR).

IETF Language Tags

An IETF language tag is an abbreviated language code that is defined by the IETF in BCP 47. A language tag consists of subtag codes separated by dashes. In SIL.WritingSystems, language tags are represented as normal strings. The library provides the IetfLanguageTag class, which contains methods for parsing and generating language tags in a consistent manner. This design is similar to the way that file paths are handled in the .NET Class Library. The StandardSubtags class provides access to the lists of valid subtags that can be used to generate a language tag. The WellKnownSubtags class contains constants for some often used subtag codes.

Subtags

SIL.WritingSystems contains classes for the four major types of subtags: LanguageSubtag, ScriptSubtag, RegionSubtag, and VariantSubtag. Subtag is the abstract base class for these classes. According to BCP 47, only subtags that are defined in the IANA Subtag Registry can be used to define the language, script, region, and variant for a language tag. Subtag objects are implicitly converted to strings and vice-versa. SIL.WritingSystems implements a convention that allows the use of private-use subtags to define the language, script, region, and variant for a language tag. To define a private-use language subtag, "qaa" is used as the language code and the custom language code is placed in the private-use area of the language tag. The same convention is used for script and region subtags. "Qaaa" is used as the script code and "QM" is used as the region code. Private-use variant subtags are simply placed at the end of the private-use area. An example of the convention in use would be "qaa-Qaaa-QM-fonipa-x-kal-Fake-TT-bogus", where "kal" is the private-use language code, "Fake" is the private-use script code, "TT" is the private-use region code, and "bogus" is the private-use variant code.

Language tags can be created from subtag objects using the IetfLanguageTag.Create method. Conversely, a language tag can be parsed into subtag objects using the IetfLanguageTag.TryGetSubtags method. The TryGetSubtags method will return implied subtags. For example, if you call TryGetSubtags on "en-US", it will return "Latn" as the script subtag even though it is not explicit in the language tag. TryGetSubtags recognizes the private-use subtag convention mentioned earlier and will return a private-use language, script, or region subtag if the language tag includes one. For example, TryGetSubtags will return "kal" as the private-use language subtag for the language tag, "qaa-x-kal". Language tags can also be created and parsed using strings instead of subtag objects. Language tags can be created by passing the different parts of the language tag as strings to a different overload of the IetfLanguageTag.Create method. Conversely, a language tag can be parsed into its constituent string parts by using the IetfLanguageTag.TryGetParts method.

Canonical Language Tags

It is possible for different language tags to have the same meaning. For example, "en-US" and "en-Latn-US" both represent American English. Language tags are also case insensitive. For example, "en-US" and "EN-us" are equivalent. In order to facilitate the comparison of language tags, the IetfLanguageTag.Canonicalize method provides the ability to canonicalize language tags. When a language tag is canonicalized, any implicit script codes are removed and standard capitalization is used. Implicit script codes are defined in the "alltags.txt" file that is provided by the SLDR. It should also be noted that any language tags created using IetfLanguageTag.Create will be automatically canonicalized.

Resources

All of the language code information is contained in a set of text files stored in the Resources folder. These files should be periodically updated to stay consistent with the source data. They can be updated using the LanguageData command line app.

alltags.txt

The alltags.txt file contains a list of all of the language tags that are available on the SLDR. The file also contains information about the relationships between the language tags. The SLDR API in the SIL.WritingSystems assembly will automatically download the latest version of the file from SLDR's GitHub repository. This file is provided for offline access. The latest version of the file can be downloaded from the SLDR repository. The file format is documented on the SLDR wiki.

ianaSubtagRegistry.txt

The ianaSubtagRegistry.txt file contains the list of all valid BCP 47 subtags. The latest version of the file can be obtained from the IANA website.

LanguageIndex.txt

The LanguageIndex.txt file contains indexing information for looking up Ethnologue (ISO 639-3) codes. This data consists of the list of ISO 639-3 language codes, the countries in which they are spoken, and all alternate language names. The latest version of this file can be downloaded from the Ethnologue website.

LanguageDataIndex.txt

The LanguageDataIndex.txt file is generated by LanguageData from the other files. See LanguageData for more details.

TwoToThreeCodes.txt

This file contains the mapping between three-letter (ISO 639-3) codes and two-letter (ISO 639-1) codes. It is generated from the ISO 639-3 code table using the following procedure:

  1. Discard all columns except the first (three-letter codes) and the fourth (two-letter codes).
  2. Discard all rows where the fourth column is empty.
  3. Swap the two remaining columns.

Writing System Model

All of the classes in the writing system model, except keyboards, are observable (using the System.ComponentModel.INotifyPropertyChanged and System.Collections.Specialized.INotifyCollectionChanged interfaces), mutable (using SIL.ObjectModel.IValueEquatable interface for value equality semantics), cloneable (using SIL.ObjectModel.ICloneable interface), and track changes (using the System.ComponentModel.IChangeTracking interface). DefinitionBase is the base class for all of these classes.

Writing Systems

A writing system is defined by the WritingSystemDefinition class. It contains all of the data that defines a writing system, such as language tag, fonts, collations, keyboards, character sets, quotation marks, etc.

A writing system definition is uniquely identified by an IETF language tag. The LanguageTag property defines the writing system's current language tag. WritingSystemDefinition also contains an Id property, which is used to identify the writing system in a writing system repository. The Id property is a language tag as well, but does not necessarily correspond to the writing system's current language tag. The Id and LanguageTag properties can be different if the language tag has changed, but the writing system definition has not been reset in the writing system repository yet. The Id should be used when interacting with the writing system repository.

WritingSystemDefinition provides properties (Language, Script, Region, Variants, IsVoice, and IpaStatus) for easily updating the various parts of the language tag in a consistent manner. The language tag is not validated when these properties are changed. This allows the language tag to be freely changed without throwing exceptions. Once all changes are finished, the ValidateLanguageTag method should be called to determine if the language tag is valid.

Collations

A collation is defined by the CollationDefinition abstract base class. Collation definitions have a Collator property, which provides access to a class that can perform collation operations. Collation definitions can be validated by calling the Validate method. If the collation definition is invalid, an exception will be thrown when its collator is accessed. Collation definitions are uniquely identified in a writing system definition by the Type property. There are three concrete collation definition classes: SimpleRulesCollationDefinition, IcuRulesCollationDefinition, and SystemCollationDefinition.

SimpleRulesCollationDefinition implements support for Toolbox-style collation rules. When this class validates the simple collation rules, it will convert the rules to ICU rules using the SimpleRulesParser class.

IcuRulesCollationDefinition implements support for ICU-style collation rules. This collation definition class can import rules from an ICU rules collation defined in the same writing system or a different writing system. In order to support collation imports, an instance of this class requires a reference to the owning writing system definition and a writing system factory. A IcuRulesCollationDefinition instance with no rules specified will use the default (root) sort order.

SystemCollationDefinition implements support for collation using the system sort order for a particular locale. The locale is specified using a language tag. A writing system definition can only have one system collation defined and it must be set as the default collation. This is because system collations are considered an application setting and the current persistence mechanism only supports a single system collation definition.

Fonts

A font is defined by the FontDefinition class. Font definitions are uniquely identified in a writing system definition by the Name property.

Character Sets

A character set is defined by the CharacterSetDefinition class. Character sets are used to define the list of characters that are valid for a writing system. Character set definitions are uniquely identified in a writing system definition by the Type property.

Keyboards

A keyboard definition is different than the other definition classes in the writing system model. It does not extend the DefinitionBase class. Instead, it is defined by the SIL.Keyboarding.IKeyboardDefinition interface. Keyboard definitions are accessed and created using the keyboard controller assigned to the static Controller property on the SIL.Keyboarding.Keyboard class. There can be only one instance of any keyboard definition, so reference equality can be used for comparing keyboard definitions. The format and URLs for a keyboard definition are set when an application creates a keyboard definition. Keyboard definitions are uniquely identified in a writing system definition by the Id property.

Writing System Repositories

Collections of writing system definitions are managed by writing system repositories. A writing system repository is defined by the IWritingSystemRepository interface. Both generic and non-generic versions of the the interface are provided. The generic version is used by applications that wish to extend the WritingSystemDefinition class and still have a strongly-typed repository API. The IWritingSystemRepository interface provides methods and properties for adding, removing, conflating, and getting writing systems. As mentioned earlier, writing system definitions are identified in a repository by the Id property. Even though the language tag for a writing system can be updated at any time, the Id is only updated when the IWritingSystemRepository.Set method is called on the writing system definition. This allows the repository to update its internal mapping of IDs to writing systems. An abstract base class, WritingSystemRepositoryBase is provided to simplify the creation of new repository implementations. There is also an ILocalWritingSystemRepository interface, which is used to implement a repository that updates a global repository whenever it is saved. There are two primary writing system repository implementations: LdmlInFolderWritingSystemRepository and GlobalWritingSystemRepository.

The LdmlInFolderWritingSystemRepository class is used to managed a folder of LDML files. Each LDML file represents a separate writing system. This implementation keeps a log of the changed writing system IDs in the "idchangelog.xml" file. Any removed writing systems are not deleted, but moved to the "trash" folder in the repository folder.

The GlobalWritingSystemRepository class is used to managed a shared folder of LDML files. This provides a mechanism to keep writing systems in sync across multiple local repositories. For example, a WeSay project has a local writing system repository. Whenever a writing system is updated in the WeSay project, the changes are propagated to the global writing system repository on the machine. A FieldWorks project, which has its own local writing system repository, would see that the writing system has changed in the global repository and pull down a copy of the updated writing system to its local repository.

Writing System Factories

Writing system definitions are created using writing system factories. A writing system factory is defined by the IWritingSystemFactory interface. Similar to IWritingSystemRepository, there is both a generic and non-generic version of the interface. In addition, there is an abstract base class, WritingSystemFactoryBase, provided for convenience. There are three primary implementations of the interface: WritingSystemFactory, SldrWritingSystemFactory, and LdmlInFolderWritingSystemFactory.

WritingSystemFactory provides a simple implementation of the factory interface. It has the ability to create a writing system from a LDML template. The location of the LDML templates are specified using the TemplateFolder property.

SldrWritingSystemFactory extends WritingSystemFactory with the additional capability to download a LDML file from the SLDR to use a template prior to looking in a template folder. This is the default writing system factory used by GlobalWritingSystemRepository.

LdmlInFolderWritingSystemFactory further extends SldrWritingSystemFactory with the ability to use LDML files from a local and a global repository as a template when creating a writing system definition. This is the default writing system factory used by LdmlInFolderWritingSystemRepository. This implementation looks for a LDML file to use as a template in the following order:

  1. Local writing system repository
  2. Global writing system repository
  3. SLDR (could get a cached LDML file if the SLDR is not available)
  4. Template folder

Persistence

Writing system definitions are persisted using a combination of LDML and application settings files. SIL.WritingSystems provides a class for reading and writing LDML data and an interface for reading and writing writing system properties to application settings.

LDML

LDML is the standard XML format for representing locale data. The LDML format supports most of the data that an application would need for a locale, including localized display names for locales, number/currency formatting, date/time formatting, collations, and exemplar character sets. LDML also supports custom extensions. SIL has defined a set of extensions to LDML for information that are specific to SIL's needs. These extensions are defined in a special SIL namespace. This namespace is a unification of various legacy namespaces (fw, palaso, and palaso2) and is the preferred namespace for all future SIL applications. The SIL namespace defines extensions for supporting various collation rule styles and external resources (fonts, keyboards, spell checking dictionaries, and transforms).

The LdmlDataMapper class provides the capability to load a writing system definition from a LDML file or to save a definition as a LDML file. It supports in place saving of LDML files so that elements of the LDML file that are not contained in a writing system definition are properly round-tripped. It also supports most of the extensions in the SIL namespace.

Application Settings

Most of the data contained in a writing system definition can be represented in LDML and the SIL LDML extensions. There are some properties of a writing system definition that are considered application-specific and are not appropriate for storing in a standard format like LDML. SIL.WritingSystems provides an extensible mechanism for persisting these properties. An application must implement the ICustomDataMapper interface and pass an instance of the implementing class to a writing system repository in order to support persisting application-specific properties. The SIL.Lexicon assembly provides standard implementations of this interface that can be used by lexicon applications. The following are the writing system definition properties that are considered application-specific:

  • abbreviation
  • language name
  • script name
  • region name
  • spell checking ID
  • legacy mapping
  • local keyboard
  • system collation locale
  • known keyboards
  • default font name
  • default font size
  • Graphite enabled flag

SLDR API

SIL.WritingSystems provides an API for accessing the SLDR web service. For more information on the SLDR API, see this document.