-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathdiscussion.tex
64 lines (36 loc) · 15.3 KB
/
discussion.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
% !TEX root = thesis.tex
\section{Discussion}
\label{sec:discussion}
In this section, I ask whether digital language development is necessary; discuss the ethics of using open source software for LRLs; consider data and privacy as reasons why more code is not open source; and finally conclude with thoughts on whether open source can be adequately seen as a tool for saving languages.
\subsection{Is digital presence necessary?}
In Section~\ref{subsubsec:response}, I pointed out how metrics of language endangerment come from a perspective that languages may need to be saved, and that external onlookers should be allowed to judge the status of languages they do not themselves speak, for communities they may not be intimately involved in. This viewpoint may not be reflective of the views of language speakers, themselves. It is not a view I feel comfortable with; as a researcher, that I was not able to evidentially contribute to Naskapi or Gaelic research while writing this paper except with a literature review and my own suggestions for best practices is unsettling.
One of the more interesting discoveries in this paper was that Gaelic was recorded by UNESCO as Definitely endangered, while Naskapi was marked as Vulnerable - a safer rating. This is interesting, as Gaelic has roughly sixty times the speaking population of Naskapi. However, it reflects the strength of the Naskapi community; almost all of the members of the community speak Naskapi, and it is used in all domains of life. However, this is happening without having technical tools like spell-checkers, wikipedias, or STT systems.
My questions regarding open source to the linguists working on the dictionary seemed vaguely off the mark. Open source, as a methodology, is useful in particular circumstances - namely, when there is a large suite of computational tools which can be used to improve the livelihood of speakers. This clearly would happen in large, Western industrialised places such as Scotland, if there are speakers who are more proficient in Gaelic than English. However, it is less clear whether open source - or much more technological advancement - would benefit the Naskapi community as much as the work currently ongoing there. Of course, it would be ideal if they were able to type with syllabics on a daily basis, and to read syllabics in all types of literature. But digital development may not be necessary to keep the language alive, at this stage.
The Naskapi are currently going through a metamorphic stage, still changing from their nomadic roots half a century ago and turning into a technology-dependent society. A new, fibre-optic cable internet connection is being laid this year from the coast; what lies ahead is uncertain.
\subsection{Ethics and open source}
\label{subsec:oss-ethics}
The quote from Richard Stallman in Section~\ref{subsec:defining-open-source} mentioned that "free software is an ethical imperative." This is, to put it mildly, a loaded statement, and comes from a philosophical viewpoint that not everyone agrees with. Open source, for all of its benefits, has serious drawbacks for developers involved in it.
For one, the overwhelming majority of open source coders on online communities are male, young, and white \citep{ghosh2002free}. A survey of 100k users from StackOverflow,\footnote{\href{https://stackoverflow.com/}{https://stackoverflow.com/}. \last{May~2}} a large language-agnostic forum for support and technical questions, found that this has changed little since in the past fifteen years, with 92.9\% of the users being male and 75\% of them white.\footnote{\href{https://insights.stackoverflow.com/survey/2018/}{https://insights.stackoverflow.com/survey/2018/}. \last{May~2}} Open source is disproportionally skewed towards already advantaged groups.
The incentives around open source contributions are also changeable, and while paid workers are more likely to contribute in the long run, users who contribute to code because of the value of the code to them are less likely to stay in the community for long periods of time \citep{roberts2006understanding, shah2006motivation}. Ultimately, it is hobbyists who end up working on code the longest, after the initial value to them has worn off \citep{shah2006motivation}. This has implications for low resource languages; is open source the best vehicle for developing language software, which may require large allocations of time and funding? On another note, is it ethical to implement a system where there is high burnout rate for developers who need it, when it may make more sense to find ways to fund direct work for a small core of dedicated developers?
To make this issue clearer, consider the position of a linguist encouraging language activists to build a localised wikipedia. Encouraging wikipedia contributions amounts to encouraging users to invest time which they may not be in an economically advantaged position to spend, as speakers of low resource languages are overwhelmingly not Western, educated, industrialised, rich, or from democratic countries - the WEIRD group described in \citet{henrich2010most}. Further, the system tends towards high initial attrition rates and with diminishing returns for the main investors. Whether encouraging someone to enter such a process is an ethical choice is left up to the reader; I certainly do not have an answer.
These are a couple of small examples of where advocating open source is not a clearcut issue. This paper is not meant to provide a solid overview of all ethical issues; however, at least some of them are worth noting here as caveats. For low resource languages, open source coding presents a clear opportunity for allowing communities to work together, cross-linguistically and between stakeholders, with a minimum of friction caused by proprietary licensing. It is my opinion that any extra work to save languages and help language comunities which can be expedited or made redundant should be. Given that languages are dying globally at exorbitant rates, we cannot afford not to work as fast as we can.
\subsection{Data and privacy}
\label{subsec:data-and-privacy}
In Section~\ref{sec:lrl-code}, I endeavoured to show that the state of open source work for LRLs is difficult to determine. Neither curated resources, linked aggregators of all resources, or mining the scientific literature are able to sufficiently answer the questions of how much code is out there, of what quality is that code, and where can language resource consumers best find their tools. However, it is probable that researchers working on a given language could easily find references to code which is relevant to their language, if it exists, using one of these three methodologies.
Unfortunately, a large amount of both data and tooling over that data is still not permissively licensed or available. Historically, linguists have not permissively licensed or provided open access to their corpora; it is specifically to combat this that large frameworks like the LDC or META-SHARE were created. However, these organisations do not solve some of the underlying issues regarding sharing data.
One issue which is unresolved is that of aligning incentives for researchers to open their research. Researching takes time and funding; opening up research to others can be seen as an act of na\"{i}ve altruism, especially in cases where the work could be easily used by competitive labs or businesses. For corpora to be open, providers may need to feel that they will be properly remunerated for the work. For some, this is less of a worry than citations and prestige. Citing linguistic data is not the same as citing research papers in journals or conferences, and only recently have there been movements towards citing data in itself. For instance, the Austin Principles for Data \citep{AustinPrinciples2017} were recently created to set guidelines for citing linguistic data. It emphasises that data is important and legitimate in the research cycle, that credit and attribution are needed where due, that it should be provided as evidence whenever there is a claim, that it should be referred to with DOIs that are persistent and unique, that it should be openly accessible, that it should be verifiable and specific to claims made, as well as interoperable and flexible in format. Each of these points could be expanded; for instance, evidentiality implies that in certain situations, producers should open confidential information if they wish to make a claim academically; for instance, Google researchers publishing results from their MT systems must also make their corpora available.
These principles can be extended to software, which historically is not cited academically (as in this paper, where a footnote to a website has for the most part sufficed). There is ongoing work in the sciences (if not in linguistics directly) on enforcing software citations \citep{DBLP:journals/corr/KatzCWHVHSJCCVL15, katz2016report}. The previously mentioned {\it Journal of Open Source Software} \citep{smith2018journal} is a good example of an effort to make code a citable object. To my knowledge, there has been no major effort linking linguistics corpora and the related tools under the same citable object. More research and collaboration here would be welcome.
Another facet regarding sharing data revolves around the sensitive nature of linguistic data itself, and ethical issues surrounding researchers or corpus architects. Participants who initially provide linguistic data may require permanent access to that data, and may wish to restrict access to others - for instance, in the case where stories or data are viewed as part of their cultural heritage, and which they view as private to their culture. Linguists taking data need to then document the wishes of the participants; and convey this on to data providers, to ensure that archivists respect the participants' and the linguists' wishes. Data which are gathered electronically {\it en masse} can also lead to difficulties, as not all participants' wishes can be easily taken into account (for instance, with large databases made by web crawlers). This milieu of needs and obligations can lead to licensing and access complications, especially with regard to LRLs. For instance, Chiarcos raised a question on the Open Linguistics mailing list\footnote{\href{https://lists.okfn.org/mailman/listinfo/open-linguistics}{https://lists.okfn.org/mailman/listinfo/open-linguistics}. \last{April~27}} regarding the legality of sharing Bible translations under EU and US law, and whether or not reuse of this data would constitute copyright violations for researchers who use the data.\footnote{\href{https://lists.okfn.org/pipermail/open-linguistics/2017-April/001359.html}{https://lists.okfn.org/pipermail/open-linguistics/2017-April/001359.html}. \last{April~27}} (There was no clear resolution in this case). There is a host of active research and discussion around this topic; \citet{liberman2000legal, newman2007copyright, rice2006ethical, austin2010communities, o2010ethical, cushman2013wampum} are recommended for further reading.
Sometimes, privacy revolves less around the users or the language communities, and more around researchers not wishing to open source their code until they are done developing their project, or until a grant ends, or until they are safe that they will not be scooped by other researchers. Other factors include the brevity of some academic funding cycles, concerns about scope, or lack of education regarding how open source works. However, the landscape is changing slowly. For instance, in a paper describing a tool for sharing interlinearised and lexical data in different formats, \citet[132]{kaufman2018kratylos} note that "Kratylos will be made open-source and accessible to the public through a GitHub repository at the end of the current grant period. Kratylos is built entirely from open- source software itself and transcodes proprietary media formats into the open-source codecs Ogg Vorbis (for audio) and Ogg Theora (for video)."\footnote{To date, this has not been open sourced. \href{http://elalliance.org/programs/documentation/kratylos/}{\nolinkurl{http://elalliance.org/programs/documentation/kratylos/}}. \last{April~27}} This is particularly insightful, as it shows that open source archives can arise out of initial closed-source development. Open source is not always a static state for code; and it is becoming more common to see open source code for LRL NLP as researchers become more familiar with current trends in software development.
\subsection{Open Source as a tool for saving languages}
So: how can the open source methodology for software development help low resource languages?
The most blatant advantage of open source is that any code developed is in the public domain; anyone can access and use it. This frees up communities to work on their own code, and leads to language developers being able to improve their languages' tech without searching for large amounts of funding, or depending on collaboration with universities or enterprises which may have different incentives and timelines. By contributing to the digital commons, it is possible to raise the quality of code for everyone, and a rising tide lifts all boats.
As \citet{streiter2006implementing} recommends, open source can also generate a shared community of researchers interested in maintaining a pool of resources. Open source can also force changes to be made in the open (at least, with a copyleft license), thus allowing community members to contribute to similar code. The social aspect of shared code should not be overlooked, as it allows newcomers to learn how to work with technology, and helps offload continued work from a few hardcore NLP hobbyists. The more coders are available within an ecosystem, the more code in that system can be developed and ultimately used - if it is open sourced.
As was clear from looking at Gaelic, open source code widely leads to accessibility and for language resource generation. The difficulty of finding resources does not mean that there are not any at the governmental, military, or enterprise level. However, what resources have been found have generally been open source; it is because Scannell and Bauer work largely with open source licensing that their work has been able to complement each other's and to build tooling around Gaelic resources. Hopefully, this trend will continue.
On a more broad level, open source can certainly help language development for other LRLs through educational materials. Currently, software developers in the millions are learning how to code using open source tooling on GitHub. NLTK is one of the most popular projects on GitHub, and with almost a thousand citations on Google Scholar,\footnote{\href{https://scholar.google.com/scholar?q=NLTK}{https://scholar.google.com/scholar?q=NLTK}. \last{April~27}} it is popular with academics, too. Open source has allowed it to thrive. Students using it may go on to use its tooling for their own languages; and, as more digital natives learn to code and as more languages find their own language communities online, it is hoped that more languages will digitally ascend.
% I covered this enough. I would just be repeating myself.
% \subsection{Why is not more code open?}
% Finally, I will go into a little detail on the question of why more has not been open sourced, and how to find open source resources.
% - Longevity of linguistic scholarship and work
% No need for a subsection; that's the entire point of this chapter.
% \subsection{How does open source demonstrably help?}