paper update

colincwilson · Jul 12, 2021 · 8ca4bea · 8ca4bea
1 parent a2770c6
commit 8ca4bea
Show file tree

Hide file tree

Showing 11 changed files with 202 additions and 130 deletions.
diff --git a/sigmorphon2021_paper/__latexindent_temp.tex b/sigmorphon2021_paper/__latexindent_temp.tex
@@ -133,9 +133,19 @@ \subsection{Completeness}
 These results validate the rule learning algorithm proposed by \citet{albright-hayes-2002-modeling} and used in our implementation. Any minimal generalization of two arbitrary rules $R_1$ and $R_2$ (as allowed by the model) can also be derived from $R_1$ (or $R_2$) by recursive application of minimal generalization with one or more base rules.
 
 
+\subsection{Relative generality}
+\label{subsec:generality}
+
+While not required for the minimal generalization operation itself, we define here a (partial) generality relation on rules. The definition uses the same notation as above and is employed in pruning rules after recursive minimal generalization has been applied (see \S\ref{subsec:pruning} below). 
+
+Relative generality is defined only for rules $R_1$ and $R_2$ that make the same change. It is sufficient to consider the right-hand contexts $D_1$ and $D_2$ and then apply the same definition to the reversed left-hand contexts. Conceptually, context $D_2$ is at least as general as context $D_1$, $D_1 \sqsubseteq D_2$, iff the set of strings represented by $D_1$ is a subset of that represented by $D_2$ when both contexts are considered as regular expressions over $\Sigma_{\#}^*$. The formal definition is complicated somewhat by $X$, which can appear at the end of either context.
+
+Replace each symbol $x \in \Sigma_{\#}$ in $D_1$ or $D_2$ with its feature set $\phi(x)$, treat $X$ as equivalent to $\emptyset$, and let $|D|$ be the length of context $D$.Then $D_1 \sqsubseteq D_2$ iff (i) $|D_1| \geq |D_2|$ and $D_1[k] \subseteq D_2[k]$ for all $1 \leq k \leq |D_1|$, except when $|D_1| = |D_2| + 1$ and the last element of $D_1$ but not $D_2$ is $X$, or (ii) $|D_1| = |D_2| - 1$, $D_1[k] \subseteq D_2[k]$ for all $1 \leq k \leq |D_1|$, and the last element of $D_2$ is $X$. Context $D_2$ is strictly more general than $D_1$, $D_1 \sqsubset D_2$, iff $D_1 \sqsubseteq D_2$ and $D_2 \not\sqsubseteq D_1$. Rule $R_2$ is at least as general as $R_1$, $R_1 \sqsubseteq R_2$, iff $C_1 \sqsubseteq C_2$ and $D_1 \sqsubseteq D_2$; it is a strictly more general rule iff either of the context relations is strict.
+
+
 \section{System Description}
 
-Our system for the shared task preprocesed the input wordforms, learned rules recursively as in \S\ref{sec:mingen}, scored the rules in two ways, pruned out rule that have no effect on the model's predictions, and  applied the remaining rules to wug forms in order to generated predicted ratings.
+Our system for the shared task preprocesed the input wordforms, learned rules with recursive minimal generalization, scored the rules in two ways, pruned rule that have no effect on the model's predictions, and  applied the remaining rules to wug forms in order to generated predicted ratings.
 
 \subsection{Preprocessing}
 
@@ -145,35 +155,44 @@ \subsection{Preprocessing}
 
 Checking that all wordform symbols appear in a phonological feature chart is also useful for data cleaning. It helped us to identify a few thousand  Dutch wordforms containing `+' (indicating a Verb - Preposition combination), which we removed. And it caught an encoding error in which two distinct but perceptually similar Unicode symbols were used for /g/.
 
-Two acknowledged limitations of the original version of the minimal generalization model, and our version, are relevant here. First, the model learns rules for individual morphological relations (\emph{e.g.}, mapping a bare stem to a past tense form), not for entire morphological systems jointly. Therefore, we retained from the preprocessed input data only the wordform pairs that instantiate the relations targeted by the wug tests: formation of past participles in German and past tenses  in English and Dutch.
+Two acknowledged limitations of the original version of the minimal generalization model, and our version, are relevant here. First, the model learns rules for individual morphological relations (\emph{e.g.}, mapping a bare stem to a past tense form), not for entire morphological systems jointly. Therefore, we retained from the preprocessed input data only the wordform pairs that instantiate the relations targeted by the wug tests: formation of past participles in German \citep{clahsen1999} and past tenses in English and Dutch \citep{booij2019}.
 
-Second, the model cannot learn sensible rules for circumfixes (xxx). This could be remedied by allowing the model to form rules that simultaneously make changes at both edges of inputs, or by allowing it to apply multiple single-edge rules when mapping inputs to outputs. As a provisional solution, we removed the invariant prefix \textipa{/g@-/} whenever it occured at the beginning of a German past participle (training or wug wordform).
+Second, the model cannot learn sensible rules for circumfixes (xxx). This could be remedied by allowing the model to form rules that simultaneously make changes at both edges of inputs, or by allowing it to apply multiple single-edge rules when mapping inputs to outputs. As a provisional solution, we removed the invariant prefix \textipa{/g@-/} whenever it occured at the beginning of a German past participle (training or wug wordform).\footnote{xxx prefix constant in wug outputs, occurred both initially and finally in training outputs; removed only if absolutely initial}
 
-\subsection{Scoring}
+\subsection{Rules}
 
-xxx
+Given the preprocessed and filter input data, a base rule was learned for each lexeme and then minimal generalization was applied recursively as in \S\ref{sec:mingen}. This results in tens of thousands of morphological rules for each of the three languages (xxx table reference).
 
-\subsection{Pruning}
+A major goal of Albright \& Hayes was to learn rules that can construct outputs from inputs (as opposed to merely rating or selecting outputs that are generated by some other source). Their model achieved this goal, and a substantial portion of its original implementation was dedicated to rule application. We instead delegated the application of rules to a general purpose finite-state library (Pynini; \citealp{gorman-2016-pynini, gorman2021}).
 
-When applied to a data set containing thousands of lexemes, recursive minimal generalization can produce 
+Each component of a rule $A \to B / C \underline{\ \ \ } D$ was first converted to a regular expression over symbols in $\Sigma_{\#}$ by mapping any feature set $\phi \in \Phi$ to the $|$-disjunction of symbols that bear all of the specified features and deleting instances of $X$. Segments were then encoded as integers using a symbol table. Pynini provides a function \texttt{cdrewrite} that compiles rules in this format to finite-state transducers, a function \texttt{accep} for converting input strings to linear finite-state acceptors encoded with the same symbol table, a composition function \texttt{@} that applies rules to inputs yielding output acceptors, and the means to decode the result back to string form.\footnote{The technique of mapping feature matrices to disjunctions (\emph{i.e.}, natural classes) of segments and beginning/end symbols, and ultimately to disjunctions of integer ids, was also used in the finite-state implementation of \citet{hayes2008}. $X$ was deleted because it occurs only at the beginning of left-hand contexts and at the end of right-hand contexts, both positions where Pynini's rule compiler implicitly adds $\Sigma_{\#}^*$.}
 
-\subsection{Application}
+\subsection{Scoring}
 
-\citep{gorman-2016-pynini, gorman2021}
-xxx
+The \emph{score} of a rule is a function of its accuracy on the training data. The simplest notion of score would be accuracy: the number of training outputs that are correctly predicted by the rule (\emph{hits}), divided by the number of training inputs that meet the structural description of the rule (\emph{scope}). Albright \& Hayes propose instead to discount the scores of rules with smaller scopes, using a formula previously applied to linguistic rules by \citet{mikheev-1997-automatic} (see xxx; one free parameter $\alpha$ set to $0.55$ as in A\&H 2003, p.127). Our implementation includes this way of scoring rules, which Albright \& Hayes call \textit{confidence}.
+
+Because confidence imposes only a modest penalty on rules with small scopes, we also considered a score function of the form $score_{\beta} = hits / (scope + \beta)$, where $\beta$ is a non-negative discount factor (here, $\beta = 10$). A rules that is perfectly accurate and applies to just $5$ cases has high confidence $= .90$ but much lower score$_{10}$ $= .33$; one that applies perfectly to $1000$ cases has a near-maximal score ($> .99$) regardless of which function is used. Clearly, these are only two of a wide range of score functions that could be explored.
+
+\subsection{Pruning}
+\label{subsec:pruning}
+
+When applied to training data consisting of thousands of lexemes, recursive minimal generalization can produce tens of thousands of distinct rules. Albright \& Hayes mention but do not implement the possibility of pruning the rules on the basis of their generality and scores. We pursued this suggestion by first partitioning the set of all learned rules according to their change and imposing a partial order on each of the resulting subsets.
 
-\citep{hayes2008}
+We ordered rules by generality (\S\ref{subsec:generality}), score, and length when expressed with features \citep{chomsky1968a}. Rule $R_2$ dominated rule $R_1$ in the order, $R_1 \prec R_2$ iff $R_2$ was at least as general as $R_1$ ($R_1 \sqsubseteq R_2$) and (i) $R_2$ had a higher score or (ii) the rules tied on score and $R_2$ was either strictly more general ($R_1 \sqsubset R_2$) or shorter. Dominated rules were pruned without affecting the predictions of the model, as we discuss next.
 
 \subsection{Prediction}
 
+Once rules have been learned by minimal generalization and scored, they can be used for multiple purposes: to generate potential outputs for input wordforms (by finite-state composition), to determine possible inputs for a given output wordform (by composition with the inverted transducer), and to assign scores to input/output mappings. Following Albright \& Hayes, we assume that the score of a mapping is inherited from the highest-scoring rule(s) that could produce it. That is, rules neither `gang up' --- multiple rules cannot contribute to the score of a mapping --- nor do they compete --- rules that prefer different outputs for the same input do not detract from the score.
+
+As for the scoring function itself, there is a range of other possibilities to consider. For example, rule scores could be normalized within or across changes, a type of competition that is inherent to probabilistic models. See \citet{albright2006} for a different kind of competition model in which rules learned by minimal generalization are weighted like conflicting constraints.
 
 
 \section{Results}
 
 % German (deu): verb stem tp past % 3417 train examples; 150 wug dev; 266 wug tst
 % mingen0 learned 31,562 rules; 3630 rules after pruning
 % dev eval (AIC): -127.5508, tst eval (AIC): -135.0 *lower is better*
-\citet{clahsen1999} % Appendix A
+% Clahsen 1999 Appendix A
 
 
 % English (eng): verb stem to past-tense form
@@ -185,7 +204,8 @@ \section{Results}
 % 7823 train examples; 122 wug dev; 166 wug tst
 % mingen0 learned 55,114 rules; 1,862 rules after pruning
 % dev eval (AIC): -58.50812, tst eval (AIC): -76.5 *lower is better*
-\citet{booij2019}
+% Booij 2019
+
 
 \section{Conclusions and Future Directions}
 

diff --git a/sigmorphon2021_paper/acl_latex.aux b/sigmorphon2021_paper/acl_latex.aux
@@ -42,10 +42,15 @@
 \citation{prasada1993}
 \newlabel{sec:mingenop}{{2.3}{2}{Minimal Generalization}{subsection.2.3}{}}
 \citation{albright-hayes-2002-modeling}
-\citation{gorman-2016-pynini,gorman2021}
-\citation{hayes2008}
+\newlabel{subsec:generality}{{2.6}{4}{Relative generality}{subsection.2.6}{}}
 \citation{clahsen1999}
 \citation{booij2019}
+\citation{gorman-2016-pynini,gorman2021}
+\citation{hayes2008}
+\citation{mikheev-1997-automatic}
+\citation{chomsky1968a}
+\newlabel{subsec:pruning}{{3.4}{5}{Pruning}{subsection.3.4}{}}
+\citation{albright2006}
 \citation{bybee1983}
 \citation{tenenbaum1999}
 \citation{plotkin1970}
@@ -64,34 +69,36 @@
 \bibcite{albright-hayes-2002-modeling}{{3}{2002}{{Albright and Hayes}}{{}}}
 \bibcite{albright2003}{{4}{2003}{{Albright and Hayes}}{{}}}
 \providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
-\newlabel{tab:accents}{{1}{6}{Example commands for accented characters, to be used in, \emph {e.g.}, Bib\TeX {} entries.\relax }{table.caption.1}{}}
-\newlabel{sec:bibtex}{{9}{6}{Bib\TeX {} Files}{section.9}{}}
+\newlabel{tab:accents}{{1}{7}{Example commands for accented characters, to be used in, \emph {e.g.}, Bib\TeX {} entries.\relax }{table.caption.1}{}}
+\newlabel{sec:bibtex}{{9}{7}{Bib\TeX {} Files}{section.9}{}}
 \bibcite{albright2006}{{5}{2006}{{Albright and Hayes}}{{}}}
 \bibcite{albright2009}{{6}{2009}{{Albright and Kang}}{{}}}
 \bibcite{booij2019}{{7}{2019}{{Booij}}{{}}}
 \bibcite{borschinger-johnson-2011-particle}{{8}{2011}{{B{\"o}rschinger and Johnson}}{{}}}
 \bibcite{bybee1983}{{9}{1983}{{Bybee and Moder}}{{}}}
-\bibcite{clahsen1999}{{10}{1999}{{Clahsen}}{{}}}
-\bibcite{corkery-etal-2019-yet}{{11}{2019}{{Corkery et~al.}}{{Corkery, Matusevych, and Goldwater}}}
-\bibcite{goodman-etal-2016-noise}{{12}{2016}{{Goodman et~al.}}{{Goodman, Vlachos, and Naradowsky}}}
-\bibcite{gorman-2016-pynini}{{13}{2016}{{Gorman}}{{}}}
-\bibcite{gorman2021}{{14}{2021}{{Gorman and Sproat}}{{}}}
-\bibcite{harper-2014-learning}{{15}{2014}{{Harper}}{{}}}
-\bibcite{hayes2008}{{16}{2008}{{Hayes and Wilson}}{{}}}
-\bibcite{kapatsinski2010}{{17}{2010}{{Kapatsinski}}{{}}}
-\bibcite{kuo2020}{{18}{2020}{{Kuo}}{{}}}
-\bibcite{moran2014}{{19}{2014}{{Moran et~al.}}{{Moran, McCloy, and Wright}}}
-\bibcite{mortensen-etal-2016-panphon}{{20}{2016}{{Mortensen et~al.}}{{Mortensen, Littell, Bharadwaj, Goyal, Dyer, and Levin}}}
-\bibcite{nakisa2001}{{21}{2001}{{Nakisa et~al.}}{{Nakisa, Plunkett, and Hahn}}}
-\bibcite{oseki-etal-2019-inverting}{{22}{2019}{{Oseki et~al.}}{{Oseki, Sudo, Sakai, and Marantz}}}
-\bibcite{plotkin1970}{{23}{1970}{{Plotkin}}{{}}}
-\bibcite{plunkett1999}{{24}{1999}{{Plunkett and Juola}}{{}}}
-\bibcite{prasada1993}{{25}{1993}{{Prasada and Pinker}}{{}}}
-\newlabel{citation-guide}{{2}{7}{Citation commands supported by the style file. The style is based on the natbib package and supports all natbib citation commands. It also supports commands defined in previous ACL style files for compatibility. \relax }{table.caption.2}{}}
-\bibcite{racz2020}{{26}{2020}{{R{\'a}cz et~al.}}{{R{\'a}cz, Beckner, Hay, and Pierrehumbert}}}
-\bibcite{racz-etal-2014-rules}{{27}{2014}{{R{\'a}cz et~al.}}{{R{\'a}cz, Beckner, Hay, and Pierrehumbert}}}
-\bibcite{strik2014}{{28}{2014}{{Strik}}{{}}}
-\bibcite{tenenbaum1999}{{29}{1999}{{Tenenbaum}}{{}}}
-\bibcite{verissimo2014}{{30}{2014}{{Ver{\'i}ssimo and Clahsen}}{{}}}
+\bibcite{chomsky1968a}{{10}{1968}{{Chomsky and Halle}}{{}}}
+\bibcite{clahsen1999}{{11}{1999}{{Clahsen}}{{}}}
+\bibcite{corkery-etal-2019-yet}{{12}{2019}{{Corkery et~al.}}{{Corkery, Matusevych, and Goldwater}}}
+\bibcite{goodman-etal-2016-noise}{{13}{2016}{{Goodman et~al.}}{{Goodman, Vlachos, and Naradowsky}}}
+\bibcite{gorman-2016-pynini}{{14}{2016}{{Gorman}}{{}}}
+\bibcite{gorman2021}{{15}{2021}{{Gorman and Sproat}}{{}}}
+\bibcite{harper-2014-learning}{{16}{2014}{{Harper}}{{}}}
+\bibcite{hayes2008}{{17}{2008}{{Hayes and Wilson}}{{}}}
+\bibcite{kapatsinski2010}{{18}{2010}{{Kapatsinski}}{{}}}
+\bibcite{kuo2020}{{19}{2020}{{Kuo}}{{}}}
+\bibcite{mikheev-1997-automatic}{{20}{1997}{{Mikheev}}{{}}}
+\bibcite{moran2014}{{21}{2014}{{Moran et~al.}}{{Moran, McCloy, and Wright}}}
+\bibcite{mortensen-etal-2016-panphon}{{22}{2016}{{Mortensen et~al.}}{{Mortensen, Littell, Bharadwaj, Goyal, Dyer, and Levin}}}
+\bibcite{nakisa2001}{{23}{2001}{{Nakisa et~al.}}{{Nakisa, Plunkett, and Hahn}}}
+\bibcite{oseki-etal-2019-inverting}{{24}{2019}{{Oseki et~al.}}{{Oseki, Sudo, Sakai, and Marantz}}}
+\bibcite{plotkin1970}{{25}{1970}{{Plotkin}}{{}}}
+\newlabel{citation-guide}{{2}{8}{Citation commands supported by the style file. The style is based on the natbib package and supports all natbib citation commands. It also supports commands defined in previous ACL style files for compatibility. \relax }{table.caption.2}{}}
+\bibcite{plunkett1999}{{26}{1999}{{Plunkett and Juola}}{{}}}
+\bibcite{prasada1993}{{27}{1993}{{Prasada and Pinker}}{{}}}
+\bibcite{racz2020}{{28}{2020}{{R{\'a}cz et~al.}}{{R{\'a}cz, Beckner, Hay, and Pierrehumbert}}}
+\bibcite{racz-etal-2014-rules}{{29}{2014}{{R{\'a}cz et~al.}}{{R{\'a}cz, Beckner, Hay, and Pierrehumbert}}}
+\bibcite{strik2014}{{30}{2014}{{Strik}}{{}}}
+\bibcite{tenenbaum1999}{{31}{1999}{{Tenenbaum}}{{}}}
+\bibcite{verissimo2014}{{32}{2014}{{Ver{\'i}ssimo and Clahsen}}{{}}}
 \bibstyle{acl_natbib}
-\newlabel{sec:appendix}{{A}{8}{Example Appendix}{appendix.A}{}}
+\newlabel{sec:appendix}{{A}{9}{Example Appendix}{appendix.A}{}}
diff --git a/sigmorphon2021_paper/acl_latex.bbl b/sigmorphon2021_paper/acl_latex.bbl
@@ -1,4 +1,4 @@
-\begin{thebibliography}{30}
+\begin{thebibliography}{32}
 \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
 
 \bibitem[{Ahyad(2019)}]{ahyad2019}
@@ -60,6 +60,11 @@ Joan~L. Bybee and Carol~Lynn Moder. 1983.
   natural categories}.
 \newblock \emph{Language}, 59(2):251--270.
 
+\bibitem[{Chomsky and Halle(1968)}]{chomsky1968a}
+N.~Chomsky and M.~Halle. 1968.
+\newblock \emph{The Sound Pattern of {{English}}}.
+\newblock {MIT Press}, {Cambridge, MA}.
+
 \bibitem[{Clahsen(1999)}]{clahsen1999}
 Harald Clahsen. 1999.
 \newblock \href {https://doi.org/10.1017/S0140525X99002228} {Lexical entries
@@ -129,6 +134,12 @@ Jennifer Kuo. 2020.
 \newblock \emph{Evidence for Base-Driven Alternation in {{Tgdaya Seediq}}}.
 \newblock Master's {{Thesis}}, UCLA.
 
+\bibitem[{Mikheev(1997)}]{mikheev-1997-automatic}
+Andrei Mikheev. 1997.
+\newblock \href {https://aclanthology.org/J97-3003} {Automatic rule induction
+  for unknown-word guessing}.
+\newblock \emph{Computational Linguistics}, 23(3):405--423.
+
 \bibitem[{Moran et~al.(2014)Moran, McCloy, and Wright}]{moran2014}
 Steven Moran, Daniel McCloy, and Richard Wright. 2014.
 \newblock {{PHOIBLE Online}}.

diff --git a/sigmorphon2021_paper/acl_latex.blg b/sigmorphon2021_paper/acl_latex.blg
@@ -15,45 +15,45 @@ Warning--I didn't find a database entry for "Ando2005"
 Warning--I didn't find a database entry for "andrew2007scalable"
 Warning--I didn't find a database entry for "rasooli-tetrault-2015"
 Reallocated wiz_functions (elt_size=4) to 6000 items from 3000.
-You've used 30 entries,
+You've used 32 entries,
             3571 wiz_defined-function locations,
-            926 strings with 12558 characters,
-and the built_in function-call counts, 21410 in all, are:
-= -- 1990
-> -- 731
-< -- 36
-+ -- 237
-- -- 195
-* -- 1332
-:= -- 3125
-add.period$ -- 132
-call.type$ -- 30
-change.case$ -- 239
-chr.to.int$ -- 30
-cite$ -- 30
-duplicate$ -- 1461
-empty$ -- 1717
-format.name$ -- 287
-if$ -- 4822
+            939 strings with 12814 characters,
+and the built_in function-call counts, 22584 in all, are:
+= -- 2094
+> -- 767
+< -- 38
++ -- 250
+- -- 204
+* -- 1401
+:= -- 3302
+add.period$ -- 140
+call.type$ -- 32
+change.case$ -- 251
+chr.to.int$ -- 32
+cite$ -- 32
+duplicate$ -- 1538
+empty$ -- 1821
+format.name$ -- 302
+if$ -- 5087
 int.to.chr$ -- 1
 int.to.str$ -- 1
-missing$ -- 285
-newline$ -- 154
-num.names$ -- 124
-pop$ -- 570
+missing$ -- 301
+newline$ -- 164
+num.names$ -- 132
+pop$ -- 600
 preamble$ -- 1
-purify$ -- 183
+purify$ -- 194
 quote$ -- 0
-skip$ -- 1126
+skip$ -- 1193
 stack$ -- 0
-substring$ -- 936
-swap$ -- 791
+substring$ -- 985
+swap$ -- 829
 text.length$ -- 14
 text.prefix$ -- 0
 top$ -- 0
-type$ -- 267
+type$ -- 282
 warning$ -- 0
-while$ -- 150
+while$ -- 158
 width$ -- 0
-write$ -- 413
+write$ -- 438
 (There were 5 warnings)