-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathlow-resource-languages.tex
287 lines (183 loc) · 65 KB
/
low-resource-languages.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
% !TEX root = thesis.tex
\section{Low Resource Languages}
\label{sec:endlang}
In this section, I outline the state of low resource languages. First I define contrasting and distinct terms which are often used to describe these languages. Then, I talk about metrics used to judge a language's vitality, before moving on to discuss digital presence.
% TODO: Note, Alexis highlighted 'they', and I'm not sure why.
% Note: I removed linguistic diversity as a central concern
\subsection{Definitions}
Before going further, it makes sense to define what the terms \emph{endangered}, \emph{minority}, \emph{low} and \emph{under-resourced}, and other terms like \emph{threatened} mean when they refer to a language. Each has slightly different meanings in different contexts, and according to the scale and metric applied.
In this section, I generally define these terms: \textit{endangered}, \textit{moribund}, \textit{extinct}, \textit{dormant}, \textit{revitalised}, \textit{historic} and \textit{constructed} languages; \textit{minority}, \textit{low resource}, \textit{under resourced}, \textit{incident} and \textit{surprise} languages; and finally \textit{computer} or \textit{computational} languages. This helps inform why I have chosen to focus on low resource languages, and specifically low resource natural languages with living populations.
All of these terms could be controversial in certain contexts, and work within larger frameworks and ontologies. I cover some of these frameworks in Section~\ref{subsec:metrics}.
\subsubsection{Endangered, revitalised, and extinct languages}
\emph{Endangered} languages are human languages that are in danger of extinction. The term is borrowed from the scientific literature describing biological species; just as there exists a very real possibility that one day there will be no more Australasian Bittern specimens in the wilds of Australia, it is also possible that one day there may be no speakers of Guugu Yimithirr. The term is not completely analogous; we can still read Tocharian texts, but Tocharian is not considered to be a living language, but \textit{extinct}, as there are no speakers who use it regularly (and who are not scholars of obscure dead languages).
A language would be considered {\it endangered} when it can be assumed that children will stop learning the language in the next hundred years (according to \citet{krauss92}). Endangered languages - as compared to critically endangered or moribund languages (see below) - are normally languages which have a high amount of speakers, and crucially are still teaching children the language. Children ensure that the language will live on to the next generation, and when this chain breaks, it is almost impossible to resurrect a language. This can be difficult to judge, as the rate of deterioration can be high. For instance, Breton had over a million speakers in 1950, but today the numbers may be as low as 200,000. Its future is uncertain.
\emph{Moribund} languages are languages which are {\it critically endangered}, in that there are no children currently learning the language and using it frequently, although there are speakers. Ainu is a good example, with roughly ten native speakers still living, all of whom are over 80 years old,\footnote{\href{https://www.ethnologue.com/language/ain}{https://www.ethnologue.com/language/ain}. \last{May~2}} although there are some struggling efforts to revive it \citep{hanks2017policy}. On the other side of the northern Pacific, Haida has a similar number of native speakers, but because of the recent development of immersion programmes, government-funded schools, and new domains for the language, it is not considered moribund. An example of a new domain for Haida would be a recent motion picture filmed entirely in Haida with ethnically Haida actors who learned their lines from the elders.\footnote{\href{https://www.nytimes.com/2017/06/11/world/americas/reviving-a-lost-language-of-canada-through-film.html}{https://www.nytimes.com/2017/06/11/world/americas/reviving-a-lost-language-of-canada-through-film.html}. \last{May~2}}
\emph{Dormant} or \textit{sleeping} languages are a stage beyond moribund languages. They have no living fluent speakers. This does not mean that the language is extinct. An example would be Mutsun, an Ohlone or Costanoan language formerly spoken near San Juan Bautista, California, whose last known fluent speaker Ascensi\'on Sol\'orsano passed away in 1930. However, in the late 90s, the Mutsun people (recognised formally as the Amah Mutsun Tribal Band) began a revitalisation project using the extensive documentation left behind by linguists, anthropologists, and a Catholic mission priest, and now there are several conversational (albeit no fluent) speakers \citep{warner2007ethics}. Ethnologue defines `dormant' as a language which has no speakers, but there is still a community that attaches its ethnic identity to the language \citep{lewis2010assessing}.
Often, dormant languages only come to attention when they are considered a \textit{revitalised} language. As \citet{warner2007ethics} notes, "Daryl Baldwin did indeed teach himself his then-dormant ancestral language, Myaamia, and is now raising his children largely in the language \citep{hinton2001sleeping, leonard2004acquisition}." Before Baldwin's work, Myaamia would have been considered a dormant language. Another example would be Manx, which lost all of its native speakers (the last being Ned Maddrell, who died in 1974 \citep{wilson2008revitalization}), but retained a score of second language speakers until today, when there are now immersion programmes for children and over a thousand speakers of the language \citep{clague2009manx}. Between 1974 and a vague point somewhere in the past couple of decades where a child could consider Manx as their first language, the language was dormant; now, however, it is revitalised.
The most famous example of a revitalised language is Hebrew, with a speaking population of over eight million,\footnote{\href{https://www.ethnologue.com/language/heb}{https://www.ethnologue.com/language/heb}. \last{May~2}} which was formerly a {\it literary} language (used mainly in relation to written texts) until revitalisation efforts began as a result of the creation of the Israeli state in the early 20th century, where it is now an official language and not in a state of endangerment. Hebrew is a good example of why terms which are often deemed to by synonymous, such as `endangered' and `revitalised', should be considered as differentiable.
While on the subject of Hebrew, it is worth mentioning that the initial efforts to revitalise it were often maligned by both Jewish communities and linguists, for a variety of reasons. First, the Jewish faith had traditionally viewed Hebrew as a holy tongue, and many religiously conservative Jews objected to the sacrilegious use of it for day-to-day matters, preferring Aramaic or Yiddish. Many also objected on the grounds that its use was connected to Zionism (why is well beyond the scope of this thesis). But most pertinently, linguists objected because they viewed revitalisation as an impossibility. If the language was dead, than it would be impossible to accurately bring it back, as literary texts are not sufficient for adequately capturing all of the intricacies of a language and how it is used. Clearly, with millions of first language speakers, this is no longer a valid point. These critics now submit that modern Hebrew is an imperfect descendant of historical Hebrew, which remains extinct, and that it reflects creolisation rather than language revitalisation (as \citet{kornai2013digital} does, citing \citet{bickerton2016roots,izreel2003emergence}) and they are likely right to do so. Revitalisation is not always a clear process.
This is especially true for \textit{constructed} languages, which are \textit{a priori} languages invented by a linguist or a community without a historical speaking community or lineage. These may be created to be logically resistant to ambiguity (such as Loglan or Lobjan \citep{okrent2009land}); for a specific artistic purpose (such as Na'vi or Klingon, meant to be spoken by aliens in science fiction \citep{schreyer2015digital, schreyer2011media}); for scientific study, such as those used by evolutionary linguists for language games with participants to discern how language might have evolved \citep{scott2010language}; or such as used in the ubiquitous Wug test by scholars of language acquisition \citep{ratner2000beginning}); or for political aims (such as Esperanto or Ido \citep{okrent2009land}). Some of these may end up with thousands of speakers, including native speakers, and a huge surplus of computational resources. For instance, Na'vi has a multilingual dictionary \citep{navidictionary} that has been translated, using computational tooling and volunteer translators, into over a dozen languages, and other dictionaries \citep{wmannis}, grammars, spell checkers, and a morphological parser, Facebook translator,\footnote{\href{https://github.com/learnnavi}{https://github.com/learnnavi}. \last{May~2}} and a Garmin audio file for navigation apps.\footnote{\href{https://learnnavi.org/media/}{https://learnnavi.org/media/}. \last{May~2}} These languages are not normally considered as revitalised or dormant, but are instead mostly ignored or actively excluded (see \citet{gibson2016assessing} for an example of this) by the scientific community altogether.
Heading back to natural languages, Latin would largely not be considered a revitalised language either, although there are immersion schools and some daily usage by the Catholic liturgy. These domains are specific and do not extend into normal life, on the whole. This does not mean it does not have some computational resources, however - the ATMs in the Vatican use Latin as a user interface language,\footnote{\href{https://gizmodo.com/5905595/the-atms-in-vatican-city-speak-latin}{https://gizmodo.com/5905595/the-atms-in-vatican-city-speak-latin}. \last{May~2}} and there are many computational resources for Latin available online, such as Perseus.\footnote{\href{http://www.perseus.tufts.edu/hopper/}{http://www.perseus.tufts.edu/hopper/}. \last{May~2}} Old Swedish, likewise, has some computational resources (admittedly, from a single research group that is humorously aware of the lack of general global interest in the field).\footnote{\href{https://spraakbanken.gu.se/swe/forskning/diabase}{https://spraakbanken.gu.se/swe/forskning/diabase}. \last{May~2}} Latin would normally be considered a \textit{historic} language, like Ancient Greek or Old English. All of these languages, while extinct themselves, have direct descendants (the Romance languages, modern Greek, and English, respectively). Some extinct languages with resources do not have living descendants, such as Tocharian.
Gothic is considered \textit{extinct} today, as it has no direct descendants, although it is still studied, and although there is a small community of writers who continue to use the language, and at least one publishing company which publishes modern work in Gothic\footnote{See \href{https://wordhoardpress.com}{https://wordhoardpress.com}. \last{May~2} Incidentally, this press is run by, of all people, me.} Not all languages have sufficient texts to be revitalised or used today: Etruscan, Minoan, and Pictish are good examples.
One could argue that some languages may be considered dormant even if there are native speakers alive, if they do not speak the language. For instance, there are a few cases where a couple of speakers are left of a language, but they do not speak it to each other due to interpersonal differences. Most famously, there is the apocryphal story of Ayapeneco, where a global m\^eme ensued from an imagined feud between the last two speakers, to the point where Vodafone released a video claiming that they helped bring the men together to save the language (to the chagrin of actual linguists and anthropologists who had worked on the language for decades).\footnote{\href{http://stories.schwa-fire.com/who_save_ayapaneco\#chapter-113060}{http://stories.schwa-fire.com/who\_save\_ayapaneco\#chapter-113060}. \last{May~2}} This has actually happened elsewhere, such as with Nisenan \citep{snyder2004practice}. Another example might be Ishi, the last Yahi and a speaker of Yana, who explained that he had no name, because there was no other Yahi man to formally introduce him. Ishi means `man' in Yana, and is what Ishi consented to be called as a placeholder for his actual name \citep{kroeber1973ishi}.
Such cases are extreme, and there will be exceptions to almost any of these categories. Even for living languages, questions of identification can be difficult. For instance, \cite{gilRiau} points to at least a dozen different interpretations of what Riau Indonesian might technically be. Defining language is beyond the scope of this thesis - however, I would be amiss not to mention this problem here.
\subsubsection{Official, \textit{de facto}, \textit{de jure}, majority, and minority languages}
All of the former definitions were seen through the lens of language communities and vitality. However, there are other lenses through which languages as a whole can be viewed - for instance, politically and computationally.
Political definitions of language include \textit{official} and \textit{working} languages. Official languages are languages which are given a definitive status by a state, normally on the national level. On the supranational level (such as is the case with the EU or the ICC), they are generally termed working languages (which is different, in turn, from a \textit{lingua franca}, which is a trade, bridge or link language used informally between groups who speak different languages themselves). These languages can be broken down into {\it de facto} and {\it de jure} languages - the latter are given legal status in the law, while the former do not have official legal status but are considered culturally and for most intents and purposes as the legal language. An example would be in the United States, where there is no {\it de jure} legal language, but the {\it de facto} language is English. This means that most resources are provided in English, and other languages are often ignored or not allocated resources by the law.
\begin{quote}
These terms, as defined by \citet{johnson2013language}, distinguish policies from one another by virtue of their alignment between law and practice, respectively. Here, {\it de jure} policies are those disseminated in legal proclamations, typically being `officially documented in writing' (p. 10). By contrast, {\it de facto} policy describes those policies that exist in {\it practice} [sic], crucially, without legal provenance or even {\it in spite} of existing \textit{de jure} polices. \citep{hanks2017policy}
\end{quote}
An example given by \citet{hanks2017policy} is the case of boarding schools in the United States and Canada for indigenous children, often forcibly removed from their home, where the {\it de jure} goal was to provide the children with a working knowledge of English, but the {\it de facto} result was that they were heavily discouraged (often through direct physical abuse to students who spoke in their language) from speaking their native tongues in the classroom or in the schools, with the result that many languages were directly endangered or lost. This has happened in many places, as well: for instance, Gaelic was forbidden in the classroom by English teachers, and children were beaten (for instance, slapped across the knuckles with a ruler) for using Gaelic.
Within a state, the proportion of population of speakers compared to the entire population generally determines whether a language is considered a \textit{majority} or a \textit{minority} language. Not all minority languages are endangered languages; for instance, Catalan, spoken by around nine million people in Catalonia and southern France, is not endangered, although it is a minority language and is not an official language of any country. There are arguments that it is the majority language for a stateless state. The same could be said of Tibetan, which is officially the minority language in a region of China, but is considered to be the majority language of the region of Tibet itself, which many view as its own state currently under illegal occupation (as with Hebrew and Israel, further political discussion is beyond the scope of this thesis).
Some minority languages have legal status as minority languages. A good example would be in Canada, where minority languages in each province are given legal protection - for instance, English in Qu\'ebec, where a majority of the speakers are Francophone, or French in Ontario, where the majority of the speakers are Anglophone. Sometimes languages with very small populations are given legal status, too. In Nunavut, a territory in Canada, the two Inuit languages of Inuktitut and Inuinnaqtun are granted legal status, although they are nationally minority languages. This is particularly exceptional given that one of them, Inuinnaqtun, has only around a thousand speakers\footnote{\href{https://www.ethnologue.com/language/ikt}{https://www.ethnologue.com/language/ikt} \citep{lewis2009ethnologue}. \last{May~2}} and comprises fewer than 3\% of the population of Nunavut.\footnote{\href{http://stats.gov.nu.ca/en/home.aspx}{http://stats.gov.nu.ca/en/home.aspx}. \last{May~2}} Another example would be Hawai'ian, which has been the state language since 1978, although it only has around 2000 native speakers and is a minority language in Hawai'i \citep{lewis2009ethnologue}.\footnote{\href{https://www.ethnologue.com/language/haw}{https://www.ethnologue.com/language/haw}. \last{May~2}}
\subsubsection{Low resource, under resourced and incident languages}
\textit{Low resource languages} (LRLs) have fewer computational resources than the larger languages that dominate global discourse. There is no distinct cut-off for defining a low resource language versus a \textit{high resource}, \textit{resource-rich}, or just a \textit{resourced} language. A \textit{low resource} language can also be indiscriminately called an \textit{under resourced} or \textit{sparsely resourced} language, and occasionally can also be called a {\it non-central} language \citep{streiter2006implementing}. The disparity in resolved definitions reflects the focus of research, as generally researchers work with specific languages on computational models, and not on large databases where a precise definition is useful. Qualifiers are often included - for instance, \citet{agic2015if}'s paper, "If all you have is a bit of the Bible: Learning POS taggers for {\it truly} low-resource languages" (emphasis added). These qualifiers are generally not considered within a rigorous system of rank - for more on that, see Section~\ref{subsec:metrics} on metrics below.
In the context of LRLs, the majority of established work revolves around adapting existing systems from high resourced languages to low resource languages. In such a case, the \textit{source} language is where the original system was originally trained or upon which it was built, while the \textit{target} language is the language upon which the system is being used, tested, or adapted. These terms are largely context dependent. Similarly, \textit{sparse} in particular is more often used to refer to a dataset, but can be used of a language when it is under resourced.
While hypothetically some languages could be defined as having no resources, there is no commonly used term such as `resourceless'. Languages without corpora of any kind would fit in this category. The most common approach towards building resources for these languages generally involves either writing down basic word lists, or recording audio files or videos and using these to bootstrap language resource development. Of course, as soon as there was one audio file or one word written in the language, then the nebulous category of `resourceless' could no longer be applied. Generally, the term used for this state is {\it undocumented}. The first steps towards documentation involve either intensive work by field linguists to discern the phonemic inventory of the language, using specific tools such as dictionary applications or audio/video applications such as Praat \citep{boersma2009praat}, which allows you to view and annotate the waveforms for spoken corpora. These resources - unannotated corpora made by field linguists for a language - are, along with word lists and basic dictionaries, often the first resources for a given language, and are often not published but are accessibly only through corresponding with the linguist or team doing the work. A new strategy involves using audio files directly, without a written stage, to describe phonemic inventories \citep{kempton2014discovering,bird2014aikuma,adams2017automatic}. In any event, a comparison with multimillion dollar projects such as Google Translate or the US Defense Advanced Research Projects Agency (DARPA) sponsored TIMIT corpus \citep{garofolo1993darpa} makes it clear that undocumented languages would be considered under resourced.
Another couple of terms often used in this general context are \textit{incident} or \textit{surprise} languages. The latter is generally used for challenges, and was first used to describe the DARPA "Surprise Language Challenge", run by their Translingual Information Detection Extraction and Summarization (TIDES) programme in 2003. The challenge's goal was to see if teams working on new languages they had not seen before (hence, `surprise') could develop sufficiently useful resources and machine translation systems within a constrained period of time \citep{oard2003surprise}. These sorts of challenges are not limited to DARPA; for instance, there was a Workshop on statistical Machine Translation held at EMNLP 2011 \citep{callison2011findings}. This workshop focused on a few tasks, one of which was based on the successful efforts by the Microsoft Translation team in 2010 to build a machine translation system for Haitian Creole that used SMS messages, after an earthquake there precipitated the immediate need for a translation system between aid workers and speakers of Haitian Creole, previously a low resource language \citep{lewis2010haitian, lewis2011crisis}. Haitian Creole, here, would be an \textit{incident} language.
\subsubsection{Computer languages}
A \textit{computer} or \textit{computational} language is a formalised language used to communicate instructions to a machine. There are a large variety of names and variants, and the definition here may be construed as insufficient. For the purposes of this thesis, a computer language is for talking to a machine, and is demonstrably different than a human or \textit{natural} language, which is generally used for communicating with humans. This definition is important only in so much as it helps clarify that I am talking about human languages when I mean low resource or endangered languages, not computer languages. The relevancy, usage, or status of computer languages is largely irrelevant here, unless it touches on resources used on human languages. For instance, any grammar written in COBOL, a sixty year old language, may be less accessible to open source coders who write primarily in Python or JavaScript, two popular languages used on the web and in the FLOSS ecosystem today. This type of situation is covered in more depth in Section~\ref{subsec:digital-permanence}.
\subsection{Metrics for language vitality}
\label{subsec:metrics}
Language health or vitality is a topic of increasing scholarship and interest. Superficially, it makes sense to use a similar system to classify languages as one would classify biological species, using the metrics defined by the International Union for Conservation of Nature (IUCN).\footnote{\href{http://www.iucnredlist.org/}{http://www.iucnredlist.org/}. \last{May~2}} They have nine levels of classification: Extinct, Extinct in the Wild, Critically Endangered, Endangered, Vulnerable, Near Threatened, Least Concern, Data Deficient and Not Evaluated. However, the system is not directly transferable - how would a dormant language be classified? One can quickly see that there is a need for a language-specific rating system.
There are various popular metrics which can be used to classify the health of a language and its community. In this section, I explain these metrics in detail, focusing on the GIDS, EGIDS, UNESCO, and LEI measurements, as suggested by \citet{yang2017toward} as the main players in the field.
\subsubsection{The Graded Intergenerational Disruption Scale (GIDS)}
The Graded Intergenerational Disruption Scale (GIDS), developed by \citet{fishman1991reversing}, is the earliest and most well known of the scales. It rates languages based on their domains of use, and on the amount of transmission and education which continues to the next generation through the parents. Figure~\ref{fig:gids} summarises the different stages. As a language ceases to be used in one domain, it becomes less likely that it will in the future, and more likely that parents will consider the language to be less useful than another. Over time, this causes the language to lose speakers (although the process is not inevitable; for example, language policy in Quebec helped secure and revitalise the language over the past half century \citep{bourhis2001reversing}). Generally, as a language's usage deteriorates and the language becomes more imperilled, the language is assigned a higher classification in GIDS, with Level 8 being the least stable, and Level 1 being the most.
\begin{figure}
\centering
\includegraphics[width=1\textwidth]{img/gids.png}
\caption{A summary of GIDS \citep{fishman1991reversing} from \citet[105]{lewis2010assessing}}
\label{fig:gids}
\end{figure}
\subsubsection{The UNESCO measurement scale}
\label{subsec:unesco}
Chronologically, the UNESCO rating was the next major scale in the field. The United Nations Educational, Scientific and Cultural Organization (UNESCO) is a specialised agency of the United Nations. In 2001, UNESCO officially recognised that biodiversity, cultural diversity, and linguistic diversity are related. This viewpoint is relatively recent, and reflects increasing appreciation that culturally diverse regions tend to collocate with biodiverse regions, and that saving diversity implies saving both \citep{nettle2000vanishing, maffi2001biocultural, maffi2004world, anderson2006language, krauss2007keynote, gorenflo2012co} (as discussed explicitly in \citet{maffi2001}, of which all of the authors were also members of the UNESCO Ad Hoc Expert Group on Endangered Languages). Encouragingly, UNESCO also clarified at this event that sustaining and encouraging linguistic diversity lies within their charter.
In their publication from that conference, \citet{brenzinger2003language} lay out nine different metrics for measuring language vitality: six evaluate general vitality, two language attitudes, and one the urgency of documentation. The UNESCO system is rigorous in its refusal to apply a single score to a language, as that would smooth over the complexities of language usage. The six factors for vitality are: intergenerational language transmission (as with GIDS), absolute number of speakers, proportion of speakers within the total population, trends in existing language domains, response to new domains and media, and materials for language education and literacy.
For each of these, \citet{brenzinger2003language} break down classification further into subcategories. For instance, when regarding intergenerational language transmission, they specify six different possible ratings - Safe, Unsafe, Definitively Endangered, Severely Endangered, Critically Endangered, and Extinct - and equate each rating with a score from null to five, with zero being the least stable. Here one of the primary issues with the UNESCO rating can be seen (as pointed out by \citet{lewis2010assessing}) - namely, that `safe' is an incredibly large category that needs more fine-grained categories, as it would account for any GIDS-rated language above Level 6.
The three other factors they consider are: governmental and institutional language attitudes and policies including official status and use; community members' attitudes toward their own language; and the amount and quality of documentation. Each of these is also rated on a null to five scale. For documentation, only a superlative rating of five would be considered to be more than low resourced, as a four rating would be given to a language where "There are one good grammar and a number of adequate grammars, dictionaries, texts, literature, and occasionally updated everyday media; adequate annotated high-quality audio and video recordings." Although useful for linguists wishing to work in the language, this may not be enough to spur language resource development. For more on this, see Section~\ref{subsec:who-makes-resources}.
In Figure~\ref{fig:unesco}, an example rating using this system, from the appendix of \citet{brenzinger2003language} itself, is included to get some grasp of how these grades work in parallel.
Importantly, UNESCO clarifies that it does not suggest using one metric over another, and that adding up the numbers in the scales - however easy that might seem, as all of the measurements except speaking population are scalar and hold the same number of levels - would be insufficient and not ideal. "\textbf{Languages cannot be assessed simply by adding the numbers}; we therefore suggest such simple addition \textit{not be done} [sic]."
\begin{figure}
\centering
\includegraphics[width=1\textwidth]{img/unesco.png}
\caption{The UNESCO grading for three Venezuelan indigenous languages \citep[23]{brenzinger2003language}. It is unclear why San\textipa{1}ma has dashes for the Response to New Domains factor; as well, the absolute number of speakers for Mapoyo is in parentheses ``to indicate that they quantify `rememberers' rather than speakers." \citep[22]{brenzinger2003language}}
\label{fig:unesco}
\end{figure}
The UNESCO ratings for languages are listed in the \textit{UNESCO Atlas of the World's Languages in Danger} \citep{unesco2014unesco}.
\subsubsection{The Extended GIDS (EGIDS)}
\citet{lewis2009ethnologue} in \textit{Ethnologue}\footnote{Also a website available at \href{https://www.ethnologue.com/}{https://www.ethnologue.com/}. \last{May~2}} pointed out some of the issues with GIDS which necessitate the creation of a new standard, and which could also eclipse or inform the UNESCO rating \citep{lewis2010assessing}. First, the levels are static, and do not account for directionality on the part of a language community up or down the strata. Second, there are language types which are not included - for instance, there is no supranational level for extremely stable languages, nor is there a level for extinct or dormant languages. Thirdly, GIDS focuses on intergenerational disruption in Level 5 and down, but in Level 4 and higher it focuses more on institutions, and this is not accounted for well enough in the framework, which primarily focuses on parents as being the primary agents of language transmissions. Finally, the lower levels are not granular enough to cover the many complexities needed for language revitalisation groups.
EGIDS - the Expanded GIDS - serves these needs by providing more granular definitions. It also draws on the extensive knowledge of languages and their usage provided not only by Ethnologue, but also by the UNESCO \textit{Atlas} and the community of linguists working with the Summer Institute of Linguistics (SIL), who fund and published Ethnologue. Figure~\ref{table:egids} shows the main categories, taken from the Ethnologue website.\footnote{\href{https://www.ethnologue.com/about/language-status}{https://www.ethnologue.com/about/language-status}. \last{May~2}} The table has been updated since \citet{lewis2010assessing}, in particular to also account for signed languages \citep{bickford2015rating}. The addition of a Level 0 and two levels beneath the scale are evident, as well as more granularity in the GIDS scale, such as can be seen with Level 6, which now has two levels, Level 6a Vigorous and Level 6b Threatened.
\begin{table}
\centering
\begin{tabular}{|p{1.5cm}|p{2.5cm}|p{9cm}|} \hline
\textbf{Level} & \textbf{Label} & \textbf{Description} \\ \hline
0 & International & {\small The language is widely used between nations in trade, knowledge exchange, and international policy.} \\ \hline
1 & National & {\small The language is used in education, work, mass media, and government at the national level. } \\ \hline
2 & Provincial & {\small The language is used in education, work, mass media, and government within major administrative subdivisions of a nation. } \\ \hline
3 & Wider & {\small Communication The language is used in work and mass media without official status to transcend language differences across a region. } \\ \hline
4 & Educational & {\small The language is in vigorous use, with standardization and literature being sustained through a widespread system of institutionally supported education. } \\ \hline
5 & Developing & {\small The language is in vigorous use, with literature in a standardized form being used by some though this is not yet widespread or sustainable. } \\ \hline
6a & Vigorous & {\small The language is used for face-to-face communication by all generations and the situation is sustainable. } \\ \hline
6b & Threatened & {\small The language is used for face-to-face communication within all generations, but it is losing users. } \\ \hline
7 & Shifting & {\small The child-bearing generation can use the language among themselves, but it is not being transmitted to children. } \\ \hline
8a & Moribund & {\small The only remaining active users of the language are members of the grandparent generation and older. } \\ \hline
8b & Nearly & Extinct {\small The only remaining users of the language are members of the grandparent generation or older who have little opportunity to use the language. } \\ \hline
9 & Dormant & {\small The language serves as a reminder of heritage identity for an ethnic community, but no one has more than symbolic proficiency. } \\ \hline
10 & Extinct & {\small The language is no longer used and no one retains a sense of ethnic identity associated with the language. } \\ \hline
\end{tabular}
\caption{Expanded Graded Intergenerational Disruption Scale \citep{ethnologuewebsite}}
\label{table:egids}
\end{table}
\citet{lewis2010assessing} also add another set of EGID levels which can be used to rate a language which is ascending in domains due to revitalisation efforts, which Figure~\ref{fig:egids-up} shows. This is useful, although it does suggest that a language uniformly descends or ascends, which may not be the case. The authors also spend time describing how to identify a language and decide which level best describes it.
\begin{figure}
\centering
\includegraphics[width=1\textwidth]{img/egids-up.png}
\caption{A summary of EGIDS ascending levels for revitalisation \citep[117]{lewis2010assessing}}
\label{fig:egids-up}
\end{figure}
They end with a quote from \citet{fishman2001can}, which explains further the purpose of EGIDS, and clarifies the general intent of language analysts in building these metrics:
\begin{quote}
Thus, any theory and practice of assistance to threatened languages - whether the threat be a threat to their very lives, on the one hand, or a much less serious functional threat, on the other hand - must begin with a model of the functional diversification of languages. If analysts can appropriately identify the functions that are endangered as a result of the impact of stronger languages and cultures on weaker ones, then it may become easier to recommend which therapeutic steps must be undertaken in order to counteract any injurious impact that occurs. The purpose of our analyses must be to understand, limit and rectify the societal loss of functionality in the weaker language when two languages interact and compete for the same functions within the same ethnocultural community and to differentiate between life-threatening and non-life-threatening losses.
\end{quote}
\citet{simons2013world} presented a review of \citepos{krauss92} clarion call \emph{The world's languages in crisis}, twenty years on, and the picture is less alarming than it initially appeared, thanks in part to the efforts of linguists since Krauss's initial paper.
\begin{quote}
This analysis has enabled us to confirm that, as Fishman predicted, the largest number, fully two-thirds, of the languages of the world are safely maintained in everyday oral use in their communities (EGIDS 6a) or are at a stronger level of development and recognition (EGIDS 0 - 5). Nevertheless, the statistics also reveal that 29\% of the world's languages are in some stage of loss or shift (EGIDS 6b - 9). Most tellingly, this is more languages than the 25\% that are in some stage of development beyond oral use alone (EGIDS 0 - 5). \citep[17]{simons2013world}
\end{quote}
\subsubsection{The Language Endangerment Index (LEI)}
Just as EGIDS expanded on GIDS, the Language Endangerment Index (LEI) was formed to resolve some of the issues with EGIDS, as well as to respond to GIDS, the UNESCO rating, and the rating in \citet{krauss2007classification}, another metric which focused almost exclusively on different ages of speakers and classified all languages with children speakers as `stable', and all with over a million speakers as `safe'. \citet{lee2016assessing} describe LEI for its use in The Catalogue of Endangered Languages (ELCat), part of the Google-powered Endangered Languages Project.\footnote{\href{http://endangeredlanguages.com}{http://endangeredlanguages.com}. \last{May~2}} The project is not only sponsored by Google, but also by an American governmental National Science Foundation (NSF) grant,\footnote{\href{https://www.nsf.gov/awardsearch/showAward?AWD\_ID=1058096}{https://www.nsf.gov/awardsearch/showAward?AWD\_ID=1058096}. \last{May~2}} and is an ambitious project (like UNESCO and Ethnologue) to catalogue all languages and to provide specific metrics of language vitality.
The authors, in describing LEI, go into detail explaining how previous classifications, while they "highlight[s] the immensity of the problem at hand", cannot easily apply to certain languages, and that these exceptions are critical to understanding whether the metrics are useful as opposed to being exceptions which prove the rule. Unlike the other papers, they explicitly mention some languages. For instance, they mention how \citet{dwyer2012tools} points out that Wutun, a Chinese-Tibetan-Mongolic language, is endangered due to a variety of factors, even if transgenerational transmission is not at risk - thus, GIDS or EGIDS may not satisfactorily categorise the language. A similar case could be made for Naskapi (see Section~\ref{sec:naskapi-vitality-status} for more on this).
The LEI uses four factors: intergenerational transmission, absolute number of speakers, speaker number trends (whether increasing or decreasing), and domains of use. Each of these is rated, like the UNESCO rating, on a scale from null to five - however, unlike UNESCO, they add these numbers up to produce a single rating. The higher it is, the more likely the language is endangered. The scales are also somewhat different; for instance, number of speakers runs on orders of magnitude, with 100,000 being the minimal amount needed for a language to be considered safe (and not a million, like in \citet{krauss2007classification}).
\subsubsection{A response to qualitative metrics}
\label{subsubsec:response}
\citet{lee2016assessing} point out further issues with some of the other assessments - most notably that "while the UNESCO framework is broad and its factors comprehensive, it does not give an overall vitality score to the language being assessed, making it difficult to compare accurately across different language" and that "while an assessment of the type and quality of documentation is doubtlessly important because it helps indicate the potential for revitalization and the urgency of further research, it is not clear that the type and quality of documentation directly affects the vitality of a language." These two points are interesting, because they reflect how the situation of \citet{lee2016assessing} influences their judgement and their decision in making LEI at all. The authors were aware that they were being overtly quantitative in their approach:
\begin{quote}
Some may prefer a more nuanced examination of a language's vitality, with the view that the factors responsible for a language's endangerment are too complex to be compared across languages. Researchers of this view would rally against quantitative measures, stating that quantitative measures can hardly be accurate. ... ELCat researchers, while sympathetic to these points of view, maintain that without understanding and investigating fundamental common factors responsible for language endangerment, very little pro\-gress will be made in assessing language vitality and, consequently, less can be done to help communities preserve their languages. ELCat strikes a balance between these different perspectives. \citep[279]{lee2016assessing}
\end{quote}
As \citet{grenoble2016response} points out, this misses the point of qualitative rebuttals, by claiming that accuracy is the most salient argument. It does not have to be, as there are more pressing concerns. For instance, all of the metrics were built on the assumptions that quantifying language endangerment is useful, and that assessment directly leads to empowering communities to revitalise their language - indeed, \citet{lee2016assessing} directly state this in the quote above. Neither of these are directly backed up by empirical research \citep{grenoble2016response}.
On another note, language itself is not indisputably something that is countable or measurable, and to think so is to reflect Western, modernist ideologies surrounding language, viewing a language as a distinct entity which is formalised in writing and education. Language could be viewed alternatively as inextricable from the speaker and the utterance, and this view is more likely to be taken by language groups which view themselves as separate from a nation-state or an ethnographic group \citep{bodo2017language}. To view language otherwise is to confine language to a countable, commodifiable entity in a post-colonial sense, which affects how the language is viewed and can have real effects on language communities. Even viewing linguistic biodiversity as something to be `saved' raises ideological concerns, as Haspelmath (one of the main editors of the \textit{World Atlas of Language Structures} \citep{wals}) notes.\footnote{\href{https://dlc.hypotheses.org/195}{https://dlc.hypotheses.org/195}. \last{May~2}} Indeed, post-colonial attitudes towards language endangerment may be endemic in the field of academic linguistics; \citet{newman1998we} certainly suggests that non-Western linguists cannot adequately document or revitalise their own languages without Western training, which presupposes that to be an informed researcher one must also conform to Western ideologies. Against this backdrop, \citet{lee2016assessing}'s claims that accuracy is something that can be attained seems to miss the mark; rather, the canonical approach to metrics is in itself a flawed approach that carries with it certain uncomfortable presumptions.
This thesis cannot hope to resolve these issues, nor is it meant to be an overview of the field of language vitality or endangerment as ideology. However, it is worth noting that metrics of language vitality do not exist in a vacu\"{u}m, and that documentation and computational efforts are also a part of a wider conversation. Literacy is not necessarily a domain into which a language has to ascend to be seen as `safe' or `vital'. Further, `technological progress' as a concept in itself is controversial, and coming to a consensus of what `progress' means is fundamental to holding meaningful discussions involving language development, community planning, or digital ascent.
% What does it mean that I don't define it here, then?
% I don't feel like this is the right place to define this, largely because my views on progress are almost luddite.
Some actions can be taken in this paper, however. Terminologically, `low resource' is intentionally somewhat neutral, as compared to `minority', `endangered', or other terms that reflect Western viewpoints. Similarly, using the term {\it language vitality} as opposed to {\it language endangerment} "represents a significant shift in the representation of attitudes toward the rhetoric of indigenous languages to one away from dire predictions about endangerment to action-oriented attitudes about vitality and sustainability \citep{grenoble2016response}." These terms are used for the rest of this paper, and any statements about resource development should be viewed as part of a narrower question of digital development (in the sense of building resources) for a specific, almost na\"ively countable view of language, unless otherwise specified.
\subsection{Digital presence}
Digital presence can be thought of as the amount of language data available for a specific language through digital sources. A looser definition could be `the amount of written text on the web', but this would miss out on several important considerations. First, linguistic data does not have to be written to be digitally encoded; videos and audio data are both examples of content which is often digitally encoded. In some cases, pictures are also relevant, especially for signed languages or for examples of written text, such as in the millions of scans of papyrus from the Egyptian city of Oxyrhynchus, which are being translated using a crowd-sourced system by thousands of volunteers \citep{williams2014computational}, or for other language mediums, such as the khipu knot system used by the pre-Columbian Incan civilisation \citep{quilter2002narrative}. Secondly, the web is not the only corpus of knowledge, nor is it the only network through which data can be accessed. Trivial examples of other corpora would be local files collected by individual field researchers that are backed up on hard drives; a similar example of another network would be a local area network in offline areas, or a university intranet.
However, the digital sphere can best be thought of schematically as a new domain for language use, and it is overwhelmingly today represented on the web. Ten years ago, it was fashionable to include references to the web "as a corpus" (as \citet{scannell2007crubadan}, for instance, cited \citet{resnik1999mining, ghani2001mining, kilgarriff2001web}, although the latter two were in reference to low resource languages); today, it is more common to cite studies on digital natives such as the 20,000 citation-strong \citet{prensky2001digital} paper,\footnote{This number is from Google Scholar (\href{https://scholar.google.com}{https://scholar.google.com}) accessed April 9, 2018.} or to assume that the web, and occasionally phone networks, are the main locations for digital communication. The web is ubiquitous; not only are more than half of the global population connected to the internet,\footnote{\href{https://www.internetworldstats.com/stats.htm}{https://www.internetworldstats.com/stats.htm}. \last{May~2}} but the internet, in developed countries, is used for all levels of communication, such as education, work, mass media, and in the home and local communities. Digital presence, then, is functionally the amount of usage on the web.
\subsubsection{Finding resources on the web}
\label{subsec:finding-resources}
Before defining metrics, a short note on how to find out if a language has any digital content on the web. In addition to using a search engine to look for content that coincides with the language's name, and hoping that it happens to be written in the target language, there are several resources which can be used to judge the amount of corpora for a language on the web. The main resource for low resource languages is almost certainly the Cr\'ubad\'an project, developed by \citet{scannell2007crubadan}.\footnote{\href{http://crubadan.org/}{http://crubadan.org/}. \last{May~2}} This is a massive crawler which looks for documents with trigram frequencies for particular languages by checking against a seed corpus for under resourced languages developed from Wikipedia, the Jehovah's Witness translations, and translations of the Universal Declaration of Human Rights (UDHR) \citep{assembly1948universal}. It is often the only corpus for a low resource language on the web, as is the case with Naskapi (see Section~\ref{sec:naskapi}). A similar project, Indigenous Tweets,\footnote{\href{http://indigenoustweets.com/}{http://indigenoustweets.com/}. \last{May~3}} collects tweets from speakers of LRLs \citep{scannell2013endangered}.
Often, a translated Bible is the next best place to look for digital content. Biblical translations are so common as a first resource that there is a body of research that uses partial or full translations of the Bible for training natural language processing (NLP) systems as a result \citep{chew2006evaluation, agic2015if}. When finding the Bible or UNDR in a target language is difficult, the next best bet is to look for resources in large aggregators of linguistic data. There are large projects which hold resources for linguists - for more, see Section~\ref{subsec:resource-aggregators}. However, these resources are not always directly reflective of a language's digital presence, but rather of the scope of resources available to computational linguists and natural language processing experts. They satisfy a different need, and tools such as Perseus\footnote{\href{http://www.perseus.tufts.edu/hopper/}{http://www.perseus.tufts.edu/hopper/}. \last{May~2}} might show that there is work done on Latin, but it does not mean that there is a large Latin-speaking community that could be measured. Instead, organic corpora - such as collected from the web by Cr\'ubad\'an - are most likely the best ways of measuring a language's foothold on the web.
Wikipedia,\footnote{\href{https://en.wikipedia.org/wiki/Main_Page}{https://en.wikipedia.org/wiki/Main\_Page}. \last{May~3}} a collaborative online encyclopaedia, is often first port of call for speakers of low resource languages wishing to develop content in their language on the web. As of April 2018, there were over three hundred different languages with their own versions of Wikipedia.\footnote{\href{https://en.wikipedia.org/wiki/List_of_Wikipedias}{https://en.wikipedia.org/wiki/List\_of\_Wikipedias}. \last{May~3}} \citepos{kornai2013digital} study (see Section~\ref{sec:metrics-for-digital-presence} below) showed this especially:
\begin{quote}
The reason is that children, as soon as they start using computers for anything beyond gaming, become aware of Wikipedia, which offers a highly supportive environment of like-minded users, and lets everyone pursue a goal, summarizing human knowledge, that many find not just attractive, but in fact instrumental for establishing their language and culture in the digital realm. To summarize a key result of this study in advance: \emph{No wikipedia, no ascent} [sic]. \citep{kornai2013digital}
\end{quote}
This may be an overstatement; while Wikipedia presence may be heavily correlated with digital presence for a language, this does not imply that it is a necessary factor. This also ignores that wikipedia presence may merely show enthusiastic hobbyists (as \citep{soria2017digital} note), and that bilingual speakers may not be interested in translating entries, although they may use their language digitally elsewhere. However, in any event, it is a useful resource for LRL research.
\subsubsection{Metrics for digital presence}
\label{sec:metrics-for-digital-presence}
\paragraph{\citepos{kornai2013digital} metric}
\citet{kornai2013digital} outlined the first major metric for describing digital presence for a language. These metrics are needed because normal metrics are not directly transferable to digital presence, as digital linguistic data is decoupled from speakers (it can survive beyond them), and because the digital domain is only one of a variety of domains for language usage. He divided languages into four possible categories: Thriving, Vital, Heritage, and Still. These can be thought of as a gradient, with digital ascent being the process of a language moving up the scale. Only 16 languages would be considered Thriving, all of which would be rated at 1 or higher on the EGIDS scale. Vital languages are those which may be in danger in the next hundred years, or show few signs of digital ascent - but they have a large population of speakers and at least some resources, such as a Bible or the UNDR; Heritage languages are dead or historic languages such as Latin which have large online presences that do not relate directly to a living language community; and Still languages show little to no presence on the web at all (although note that this does not mean that they are endangered or moribund outside of the web.) A table from \citet{kornai2013digital} for the languages sampled is provided in Figure~\ref{fig:kornai}.
\citet{kornai2013digital} looks at five confluent factors; demographics, prestige, the identity function of the language, the level of software support, and Wikipedia presence for a language. Demographics and community size can be gathered by doing a quantitative analysis of all public data available in a language on the web, and by using this data size as a proxy for the amount of speakers of a language using the digital space. This has obvious limits, which Kornai points out, in that the data may not accurately reflect the amount of users, in that it is limited to public data accessible by researchers, and in that it does not give an accurate representation of passive consumption of multilingual data. It would be worth adding that this also does not give an accurate count of multilingual usage of a language. Prestige is an obvious factor for digital ascent; when a language community views one language as more useful or relevant than another, it is more likely to create digital content in one than the other, regardless of social policies and to some extent speaker populations. Identity function marks whether speaking populations identify themselves with a language, and is used largely to weed out certain historical languages, like Latin and Classical Chinese, which have large corpora online but should not be considered in the same grouping as more vibrant, living languages.
\begin{figure}
\centering
\includegraphics[width=1\textwidth]{img/kornai.png}
\caption{Summary characteristics of language by class \citep[9]{kornai2013digital}}
\label{fig:kornai}
\end{figure}
Software support as a factor in digital presence could be identified with a variety of different metrics. Kornai lists various stages for a language on the road of digital ascent. First, localisation of internalisation (often expressed using the shorthand l10n or i18n, where the numbers refer to the length of the words) of the language script is the major milestone that separates languages which are ascending from still languages. While many scripts use the more common Roman, CJK (a shorthand for Chinese, Japanese, and Korean languages), Cyrillic, or Arabic alphabets, there are hundreds which do not, and these languages have specific Unicode considerations which need to be met for the language to be used adequately. The next step would be word-level tools, such as dictionaries, stemmers, and spellcheckers - all of which depend, at some point, on standardisation of the language. Finally, sentence level tools such as automatic translators can be used. Regarding support, the question of a language's status is straightforward: is there language support for an operating system provided by Apple or Microsoft? If so, then it is likely that the language is thriving or vital. If not, there is almost zero chance of it being so. Kornai also used the Cr\'ubad\'an Project, UDHR and biblical presence, and presence on Omniglot and OLAC (see Section~\ref{subsec:resource-aggregators}).
The best indicator of a language's digital presence was their EGIDS rating. "The next best set of features indicated the quality of the wikipedia, followed by the number of L1 speakers, the size of the Cr\'ubad\'an crawl, the existence of FLOSS spellcheckers, and the number of online texts listed in OLAC." \citep[6]{kornai2013digital} Overall, only 5\% of the world's languages were seen as digitally ascending; like most results from this field, an increasingly dire statistic. As \citet[10]{kornai2013digital} writes:
\begin{quote}
Unfortunately, at a practical level heritage projects (including wiki\-pedia incubators) are haphazard, with no systematic programmes of documentation. Resources are often squandered, both in the EU and outside, on feel-good revitalization efforts that make no sense in light of the pre\"{e}xisting functional loss and economic incentives that work against language diversity \citep{ginsburgh2011many}.
\end{quote}
However, others have noted that the prediction that most languages will not digitally ascend may be overly pessimistic \citep{gibson2016assessing}.
In a follow-up paper, \citet{kornai2015new} proposed adding a single number scale to assess digital ascent (\`a la LEI): "For the assessment we propose a simple log-linear formula that derives a single number {\emph D} (digital vitality index) as a weighted sum of well-understood components such as the EGIDS ranking, (log) number of L1 speakers, (log) size of wikipedia, adjusted for quality, (log) crawl size, the existence of FLOSS spellcheckers, etc." % Removed because it wasn't clear
% The EGIDS ranking, although not a measurable ranking was considered objective, given that SIL linguists are generally interested in longer term work with communities as opposed to relatively short-lived or quantitative studies done by computational linguists.
This log-linear formula was innovative for cleaning Wikipedia, in particular, as it removes the likelihood of large wikipedias built by hobbyists with bots as being indicative of large language communities.
\paragraph{\citepos{gibson2016assessing} extension}
\citet{gibson2016assessing} extends Kornai by adding two separate statuses for languages: Emergent and Latent. Emergent languages are those where there is data, but it is privately hidden in messaging applications or cellphone usage, and unlikely to be accessible by the crawlers and corpora agglomeration tools used in \citet{kornai2013digital}.\footnote{Whether scrapers used to gather corpora from private messaging platforms, such as in \citet{littauerfacebook}, would figure in to this status is uncertain.} These would be identified by researchers in the field, and do not need to have locale or i18n setups before inception. Gibson cites Arabizi (as noted by \citet{darwish2013arabizi}), where numbers are used for sounds not present in standard Arabic, as an example; another might be the use of a forward slash to denote accents in early Irish Gaelic forums, as noted by \citet{scannell2007crubadan}. Latent languages are languages which meet the following criteria: "stable intergenerational transmission of the language, an available model of writing the language, the availability of appropriate technology and infrastructure (internet, mobile phone coverage), fonts in which to write the language in the desired script, and communal desire to see the language used digitally." If all of these are met, then the language could ascend beyond Still into Vital. Such languages would be admittedly impossible to find by measurements, but this category would be helpful for linguists working in the field to determine how to best work with the language community to help bootstrap language development. Gibson also redefined Still, which \citet{kornai2013digital} had marked as languages which are `unable' to ascend, while here they are merely `unlikely'.
\paragraph{\citepos{soria2017digital} metric}
A more recent metric was also introduced in a draft by \citet{soria2017digital}, for the purposes of helping digital language planning for the EU, as part of the Digital Language Diversity Project.\footnote{\href{http://www.dldp.eu/content/reports-digital-language-diversity-europe}{http://www.dldp.eu/content/reports-digital-language-diversity-europe}. \last{May~2}} Their scale has the following states: Pre-digital, Dormant, Emergent, Developing, Vital, and Thriving. Like Gibson, they exclude \citet{kornai2013digital} Heritage status (noting incorrectly that Gibson also included it, which he had not for the same grounds), without sufficient explanation as to why dead languages are not relevant when there are communities based around them, some of which are communities with thousands of L2 speakers. Dormant would be equivalent to Latent, while Pre-digital would apply to languages without internet or cell connectivity for the speaking population. Emergent through Thriving are largely matters of scale. While Kornai used proxies for the five factors he mentioned, Soria et al. note that such factors are difficult to quantify; they remedy this by focusing on three indicators: "a group pertaining to a language digital {\it capacity} [sic], a group related to a language digital {\it presence and use}, and a group related to a language digital {\it performance}." \citep[5]{soria2017digital} An example of how these are used can be seen in Figure~\ref{fig:dldp}.
\begin{figure}
\centering
\includegraphics[width=1\textwidth]{img/dldp.png}
\caption{Indicators of digital vitality \citep[6]{soria2017digital}}
\label{fig:dldp}
\end{figure}
\citet{soria2017digital} go into depth about each of these factors. As an example, for localised software, they propose the following scale in Table~\ref{table:dldp-software}. They explain, for each scale, how to find information - for instance, they suggest asking local researchers and community members about the usage of "Windows, Mac OS X, Linux, Android, iOS, Microsoft Office, LibreOffice, Firefox, Chrome, Internet Explorer, Thunderbird, Adobe Creative Suite, Gimp" for judging localised software. However, they do not show metrics on any languages judged according to this scale, and they do not make it clear whether or not the different metrics ought to be summed to come up with a single number (an issue which \citet{lee2016assessing} raised with the UNESCO rating). In conclusion, while this is an interesting and in-depth metric, its wider applicability is not clear.
\begin{table}
\begin{center}
\begin{tabular}{|p{2cm}|p{1.5cm}|p{9.5cm}|} \hline
Label & Grade & Localised software \\ \hline
none & 2 & Neither operating system nor general purpose soft- ware localised in the language\\ \hline
limited & 3 & At least one operating system (either desktop or mo- bile, either open or commercial) localised in the language \\ \hline
medium & 4 & At least one desktop and one mobile operating system (either open or commercial) + some general purpose software (a word processor and a browser) localised in the language\\ \hline
strong & 5 & Most used operating systems and general purpose software localised in the language; some specific purpose application software localised.\\ \hline
advanced & 6 & Main operating systems and application software localised in the language. \\ \hline
\end{tabular}
\end{center}
\caption{Scale for Localised Software \citep[21]{soria2017digital}}
\label{table:dldp-software}
\end{table}
Each of these metrics suffers from growing pains. For instance, there is no metric as of yet which ranks English in its own category - something which was seen as a large enough issue to cause the EGIDS authors to add another null ranking for supranational languages. As well, there has not been an integrated approach looking at quantitative and qualitative measurements together. The most substantial work on this has been Kornai's team, which has worked with funding from SIL International on a Digital Language Vitality database.\footnote{\href{https://hlt.bme.hu/en/projects/lingvit}{https://hlt.bme.hu/en/projects/lingvit}. \last{May~2}}
\subsection{Summary}
Above, I have covered many of the terms and metrics used to compare different languages. As was clear in the definitions section, language classification is not always an easy task. Different starting points influence how one views a language, and each language is a multifaceted, complex system. Even if one narrows down the intent to merely talking about minority, low resource, natural, living, endangered languages, there is still a lot of wiggle room for saying exactly how a language is faring on the global stage and what steps should be taken by linguists and language developers to help the language stay vibrant. Each popular metric is different, and some of them fundamentally disagree on how to measure a language's vitality. There are two main camps: the quantifiers (EGIDS, LEI) who want to give a specific number to a language to determine language endangerment, and the qualifiers (UNESCO) who are willing to let complexity be a part of the classification of language vitality.
The situation gets more complex for digital presence - the state of a language on the internet (or simply electronically); I covered three different proposals for saying how far a language has ascended into the digital sphere. Like GIDS, \citet{kornai2013digital} set the stage for how to rank a language in the digital sphere; the following metrics are largely derivative. With that in mind, it is worth asking what sort of resources exist for languages. I cover this in the next section.
% Chapter needs some sort of conclusion, focusing in and summarizing the most important points