Skip to content

Latest commit

 

History

History
29 lines (20 loc) · 2.03 KB

README.md

File metadata and controls

29 lines (20 loc) · 2.03 KB

Linguini

Linguini is a tongue twister generator written in Scala. It started as a 24 hour hackathon project at HackPotsdam II, but we hope to continue to improve the tongue twisters it generates.

Methodology

Many tongue twisters rely on repeating the same phoneme (Peter piper picked ...) or rotating amongst phonemes that sound similar, but have largely different tongue positions (She sells sea shells ...).

In either case, we'd expect that the phoneme distribution of such tongue twisters is highly concentrated over a few phonemes. We could just select a collection of words such that the phoneme frequency has low entropy, but this has no guarantees of generating a coherent English sentence. A natural idea to fix this problem is the following procedure:

  1. Use a treebank to get several sequences of parts of speech to use as sentence templates
  2. Create a collection of words corresponding to each part of speech
  3. Use the pronunciation dictionary to get phoneme frequencies for each word.
  4. For some sentence template, replace each part of speech with an appropriate word that decreases the entropy of the phoneme distribution of the sentence.

This idea manages to generate sentences with some structure, but not enough to pass as gramatically correct. Our first idea for generating sentences is fairly naive - so we expect to improve on the grammatical correctness significantly.

Cherry-picked examples of tongue twisters generated by Linguini

  • I stressed, tasked at that scarce text yet tasked at tests at that states street.
  • Or there wore wars warring wrong to all war, rare long walls
  • Our barbed blond brand, branch, and bands, I stand straight front, can next stanch strained at a draft stance on stark abstracts and contracts
  • Uh, us saw a sauce, so some slump spawned up’s

Data Sources

At the time of writing, Linguini makes use of:

  • the MASC-Propbank-Orig dataset, available here
  • the CMU pronouncing dictionary, available here