-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathmethods.tex
75 lines (42 loc) · 18.9 KB
/
methods.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
% !TEX root = thesis.tex
\section{Best Practice Recommendations}
\label{sec:methods}
It is customary when doing a review to give advice around best practices, to make not just the next researcher's job easier, but also to help lift the quality of the state of the field, in general. Here, in the same vein, are some recommendations for utilising open source for LRL NLP. I recommend different licenses for future work, where to store code to maximise its utility and exposure, and how to share code without using a centralised service such as GitHub. This chapter is especially useful if you develop LRs or work with LRs.
\subsection{Choosing a license}
\label{choosing-a-license}
Legal advice on the internet is often preceded by the initialism IANAL, stating "I am not a lawyer", or sometimes "I am not your lawyer." The following is not meant to constitute legal advice, and I am not liable for any advice given here.
That having been said, licensing software as open source is something to be encouraged. Section~\ref{subsec:licenses} lists many licenses which are considered open source; any of them should work for most purposes (although I would recommend against the Unlicense in favour of a CC0 license, following the Free Software Foundation's advice that it is "more thorough and mature".\footnote{\href{https://www.gnu.org/licenses/license-list.html\#Unlicense}{https://www.gnu.org/licenses/license-list.html\#Unlicense}. \last{May~3}})
\citet{streiter2006implementing} recommends using the GPL license for any software contributed into a software pool, their terminology for community-curated open source software. They also recommend the lesser GPL, as needed; however, GPL is preferred because it enforces that all modifications to software be brought back to the original moderator for acknowledgement, which allows for the source code to be updated. A specific example they give is of Scannell's Irish spell checker.
\begin{quote}
The case of Irish language spell checking is illustrative in this regard. Kevin Scannell developed an Irish spell checker and morphology engine in 2000, integrated it into the Ispell pool, and released everything under the GPL. Independent work at Microsoft Ireland and Trinity College Dublin led to a Microsoft-licensed Irish spell checker in 2002, but with no source code or word lists made freely available. Now, roughly five years later, the GPL tool has been updated a dozen times thanks to contributions from the community, and the data have been used directly in several advanced NLP tools, including a grammar checker and an MT system. The closed-source word list has not, to our knowledge, been updated at all since its initial release. Indeed, a version of the free word list, repackaged for use with Microsoft Word, has all but supplanted use of the Microsoft-licensed tool in the Irish-speaking community. \citep[282-283]{streiter2006implementing}
\end{quote}
While I agree that GPL is a great license for some cases, I would recommend against GPL for another reason. Code is often maintained by a single author, and GPL puts undo pressure on the author to maintain the code in the long term. Maintenance of code is difficult, as it involves work time that is often not paid, and as it requires that the author of the code set expectations around the level of maintenance.
For this reason, I have always licensed my own code under the MIT license, which waives all liability and insists that the code therein is provided as-is. Explicitly stating that you, the maintainer or creator (or both), are not responsible for the long term maintenance of your code makes whatever maintenance that is done easier on the maintainers, as it removes undue pressure to keep code updated. On the other hand, removing the self-induced pressure to update code can lead to abandonware - code which is released into the commons and then not updated, such as TileMill which \citet{gawne2016mapmaking} used in their paper, which is no longer updated. I think that stating that you are not interested in maintaining your code at the expense of your free time is a reasonable price to pay for stopping burnout for you or other maintainers. Burnout is a major factor influencing coders leaving open source; setting boundaries is an invaluable way to continue working well, in the long term.
It is worth noting that work published without a license on a public site is not technically open source. When software is not licensed, it by default reverts (in the US legal jurisdiction, anyway) to copyright where {\it all rights are reserved}, which is by definition not FLOSS. For this reason, it is important to add a license to code if it is in your purview to do so, and if you wish to follow the open source methodology.
For linguists in academia, it is also worth remembering that the linguistic science is part of a wider field of researchers that work on similar tools. Sites and groups dedicated towards open access and open research can apply to linguistic work, as well. Looking for guides on OpenAire,\footnote{\href{https://www.openaire.eu/}{https://www.openaire.eu/}. \last{May~3}} the Open Knowledge Foundation,\footnote{\href{https://okfn.org/}{https://okfn.org/}. \last{May~3}} ROpenSci,\footnote{\href{https://ropensci.org/}{https://ropensci.org/}. \last{May~3}} AltMetrics,\footnote{\href{https://www.altmetric.com/}{https://www.altmetric.com/}. \last{May~3}} the Open Science Framework,\footnote{\href{https://osf.io/}{https://osf.io/}. \last{May~3}} and others may be helpful for planning research methodologies from an open perspective.
\textbf{Recommendation}: When you release code, specify your license. If you are ready to clearly set expectations around how much code you are willing to do, I would use the MIT license. However, if this is not a concern because you feel that you are willing to maintain code in the long haul, I would use the GPL license.
Again: take these recommendations with a grain of salt, as I am not your lawyer.
\subsection{Choosing repositories}
\label{choosing-repositories}
Where to store your code is a question that must be answered if code is to be open sourced.
There are alternatives to using academic institutions as code providers; host your own, or use a larger institution that has space set aside for maintenance. Or, release the code publicly using whatever enterprise solution seems like it will last the longest. However, these do change - for instance, Sourceforge was very popular before GitHub rose to the top of the field, and now many projects are moving off of Sourceforge and onto GitHub \citep{finley2011github}, which takes time and effort (why is another matter, and may be related to network effects. For more work mining these networks, see \citet{thung2013network, kalliamvakou2014promises}).
Another idea would be to store your data on peer-to-peer or decentralised networks (as in Section~\ref{subsec:sharing-code-without-a-platform}), which lessen the risk of centralised storage facilities, but also require a peer to serve the files for longevity to be assured. Ultimately, the best bet is to build files and code which are actively used by the community; the long tail of disused projects are at the most risk, while more popular projects will find a way to survive.
All of the options mentioned so far - hosting it yourself, hosting it on an academic website, using a third-party hosting company - have their costs and benefits. If you have the resources to host the code yourself, I would suggest doing so. Unfortunately, this means that your site becomes the bottleneck for entry and discovery. Academic sites, on the other hand, may be more easily accessed by researchers in the field. However, public sites - like GitHub - are where most open source code lives, as was established in Section~\ref{subsec:where-is-open-source-code}.
For this reason, I explicitly recommend using GitHub as a storage space for open source code. Unfortunately, GitHub is a private company, and its long term goals may not align with scientists interested in century-long timelines. The Rosetta Project,\footnote{\href{https://rosettaproject.org/}{https://rosettaproject.org/}. \last{April~27}} run by the Long Now Foundation, aims to store human languages for millennia - and forward thinking at this scale, while not normally used by academic researchers, raises the question of how long code ought to be stored and whether or not short term solutions are adequate.\footnote{Anecdotally, the Long Now Foundation is also interested in low resource languages, as they reached out to me concerning my dictionaries of Na'vi \citep{navidictionary}, Dothraki \citep{dothrakidictionary}, and another constructed language I crafted called Ll\'arri\'esh \citep{littauerllarriesh}, all of which are now stored on their language archive. The Rosetta Project was excluded from \citet{kornai2013digital} as their archives are not reflective of digital language usage.}
I mentioned briefly in Section~\ref{sec:solutions} that I mirrored all of the Sourceforge repositories I found onto GitHub. Mirroring involves copying an entire code base - importantly, along with the license, so that there is no mistaking authorship - to another ecosystem or service, to maintain it in the long run. It is for this purpose that I set up the GitHub organisation @LowResourceLanguages\footnote{\href{https://github.com/lowresourcelanguages}{https://github.com/lowresourcelanguages}. \last{April~27}} (tangentially connected with the similarly named low-resource-languages repository). This organisation works as a shell to mirror code archives which might otherwise be lost.
I highly recommend mirroring all of the code that you open source, not only on GitHub, but on your personal server if you have one, and, if possible, within @LowResourceLanguages. This affords maximal accessibility, longevity, and indexing within the vibrant GitHub ecosystem. Of course, you should also index and reference your code in relevant research papers, and on any of the large aggregators.
\textbf{Recommendation}: Put your code on GitHub, and mirror it on a personal server and on your university website. Make it clear where people can email you about fixes or bugs, and where you accept patches. GitHub is your best bet for this, at the moment.
\subsection{Sharing code without a platform}
\label{subsec:sharing-code-without-a-platform}
% This section needs a more general introduction -- what is meant by "without a platform"?
% Be more explicit about the fact that you're now talking about an alternative to the kinds of solutions discussed in the previous subsection. And also, would you recommend doing both of these simultaneously? (perhaps something to go in the chapter summary/conclusion)
I have mentioned GitHub and personal or academics servers many times above, but they are not the only options. Each of these three options has a possible point of failure: GitHub can go down, just as your server, your provider, or your academic host can. Ideally, your code would exist within large aggregators which could host it for you, as well, but there currently is no centralised codebase for linguistic code resources. OLAC, META-SHARE, LRE Maps, LingHub, LinguistList, and the LLOD all are link aggregators, not hosting providers for code. As far as I am aware, @LowResourceLanguages on GitHub is the only code base for LRLs which explicitly hosts the code. But it also relies upon GitHub's presence; which may change in ten, twenty, or a hundred years.\footnote{Since finishing the first draft of this thesis and editing it, GitHub has been sold to Microsoft. It is now under new management, and its future is not entirely clear. See \href{https://www.nytimes.com/2018/06/04/technology/microsoft-github-cloud-computing.html}{https://www.nytimes.com/2018/06/04/technology/microsoft-github-cloud-computing.html} for more. \last{June~6}}
Peer-to-peer (p2p) technology may provide a solution to the problem of hosts changing, and is an alternative to the options above. p2p networks work by using protocols to communicate between nodes (users, computers, or servers) in a network. Each node holds a copy of the file and any node which wants a copy can get it from any other node which has it. The more nodes hold a file, the easier and faster this transfer process becomes; and, if one node goes down, the other nodes can still transmit files. This allows for data permanence on a level which is unknown on on the HTT- and TCP-based web, which you normally access through your browser.
IPFS, the InterPlanetary File System,\footnote{\href{https://ipfs.io/}{https://ipfs.io/}. \last{April~27}} is one such system which could be used to host data in the long term, and which has been shown to be an effective conduit for data even when malicious actors seek to take it down.\footnote{\href{https://cryptoinsider.com/content/ipfs-first-win-the-catalan-referendum/index.html}{https://cryptoinsider.com/content/ipfs-first-win-the-catalan-referendum/index.html}. \last{May~3}}\footnote{\href{https://observer.com/2017/05/turkey-wikipedia-ipfs/}{https://observer.com/2017/05/turkey-wikipedia-ipfs/}. \last{May~3}} Dat is another similar project,\footnote{\href{https://datproject.org/}{https://datproject.org/}. \last{April~27}} which has been used to save data which was deleted during by the Trump administration from US governmental websites.\footnote{\href{https://medium.com/@maxogden/project-svalbard-a-metadata-vault-for-research-data-7088239177ab}{https://medium.com/@maxogden/project-svalbard-a-metadata-vault-for-research-data-7088239177ab}. \last{April~27}} Both of these systems use hashes - deterministic DOIs based on data, which are part of the system that underly the Git tool used by GitHub and other researchers - to point to content, as opposed to locations. This allows for faster connections, offline usage with connected nodes that are not connected to the web itself, less link rot, greater specificity of content, and decentralisation.
Without going into too much detail, storing data on IPFS and then sharing it between nodes is relatively easy. For instance, the JSON data\footnote{\href{https://gist.github.com/RichardLitt/e60bcf9f399939b16181bf25ad6da8ba}{Available at https://gist.github.com/RichardLitt/e60bcf9f399939b16181bf25ad6da8ba}. \last{April~26}} used to analyse the {\tt low-resource-languages repository} in Section~\ref{sec:solutions} could be uploaded to IPFS. If one downloads the JSON data as a local file, and then installs IPFS,\footnote{\href{https://ipfs.io/}{https://ipfs.io/}. \last{April~27}} one can add the data to the IPFS p2p network. This would be done by running the command {\tt ipfs add data.json} in the code terminal. This command returns a hash (again, basically a DOI) which points to the data. This has looks like this: {\tt QmPztYpkC3aSs\-MYKDcod\-3wJtvoivbp\-NDfxNKQ6dwxnzA52}, and will never change for the JSON file in question, no matter how many times it is uploaded.
The hash can be shared by anyone who runs IPFS, meaning that they are now storing the code on their own device, as well. It can also be accessed through a gateway to IPFS, just like SPARQL can be used to mine an RDF database like Linghub by accessing the LOD. For instance, by going to \href{http://ipfs.io/ipfs/QmPztYpkC3aSsMYKDcod3wJtvoivbpNDfxNKQ6dwxnzA52}{\nolinkurl{http://ipfs.io/ipfs/QmPztYpkC3aSsMYKDcod3wJtvoivbpNDfxNKQ6dwxnzA52}},\footnote{\href{http://ipfs.io/ipfs/QmPztYpkC3aSsMYKDcod3wJtvoivbpNDfxNKQ6dwxnzA52}{http://ipfs.io/ipfs/QmPztYpkC3aSsMYKDcod3wJtvoivbpNDfxNKQ6dwxnzA52}. \last{May~3}} one can download the file directly. Uptime for this file may depend upon the \href{https://ipfs.io}{https://ipfs.io} gateway, but it should be available most of the time, forever. The code will always be available within the IPFS network for anyone who accesses it at that hash, regardless of whether the gateway is up or not. This is similar to RDF and a SPARQL gateway, except that the underpinning logic does not depend upon XML specifications, but the data itself.
There are more applications than just storing data, however. Some similar projects are already being used by non-central language communities. For instance, Guyanese communities are using p2p systems combined with GIS to map illegal logging on their land, all while being offline and not being connected to the main internet.\footnote{\href{https://www.digital-democracy.org/}{https://www.digital-democracy.org/}. \last{April~27}} \citet[90]{jancewicz2002applied} talked at length about how Naskapi development benefited from a linguist working hand-in-hand with local communities, versus long-distance arrangements as with Cree, which resulted in slower uptake of tooling and in adverse standardisation of syllabics and keymapping. A p2p network could help in these environments. It could also be used to share linguistic data within a language community, without depending upon an institutional archive in another country, a significant barrier to access and licensing control for language communities. There is almost certainly exciting work to be done with LRLs and p2p networks.
\textbf{Recommendation:} If you can, back up your data on a p2p network. This will enable other researchers to get your data easier, will provide it with a DOI you can reference, and will remove the risk of link rot. You should do this \textit{in addition} to adding it to GitHub or your personal servers.
\subsection{Summary}
Figuring out how to store, license, and disseminate your code is not easy. Above, I have given a few small recommendations based on the knowledge I have of the open source ecosystem, and on exhaustively looking for open source LRL resources. Choosing a license is difficult; settling on popular ones like MIT or GPL removes a lot of the headache of understanding all of the intricacies of software law. Choosing where to put your code is also hard; however, following the popular crowd here is also a best bet, as putting the code where all of the coders are (on GitHub) heightens the chances that users who are looking for code likes yours will find it. Putting it in a folder without a link in a university website does not help many people, in comparison. Finally, using a p2p system is a way of ensuring longevity in the long run, although it does require more legwork and discoverability is most likely not yet a feature of these networks. The best thing to do is remember that there are stages of publications; do what you can with the time you have. Choosing a license and copying the text takes half a minute; installing IPFS takes a while. Publishing without a link to your code does not take any time at all, but publishing with one may save you hours and days of work down the road, if someone else finds it and helps you out by submitting a patch, emailing a thank you note, or running your code and citing your publications.
For language communities, publishing code publicly can also be massively beneficial, as it will allow others to use the tools that they want in the languages you speak. Selling a tool built for a low resource language severely limits the potential users. License, host, and give the tool away for the most beneficial result, both for your language communities, and for you.