groupme-text-generator

About

This project generates text based on a GroupMe chat I've had with my best friends for a few years. Text can be generated using training text from an indivual group member or from the full chat history, and it can be generated on a word-by-word or character-by-character basis. You can also pick the underlying n-gram model used to generate text.

Demo

Try it yourself!

How it Works

Text is generated using a n-gram language model, with the n-gram probabilities estimated using the maximum likelihood estimate (MLE).

To explain further, the n in n-gram is the number of consecutive tokens said to be correlated. As an example, using a trigram model, where n = 3, we can make the following simplification to a sentence probability (where <s> is a special token before the sentence):

Original probability:

P("the dog is very cute") = P("the") x P("dog"|"the") x P("is"|"the dog") x P("very"|"the dog is") x P("cute"|"the dog is very")

Trigram probability:

More information about trigram and other n-gram language models can be found here.

The simplest way to find a specific n-gram probability is by using MLE, which is basically a division of counts. An example using the trigram "the dog is" is shown below:

P("the dog is") = C("the dog is") / C("the dog")

where C stands for the count of its parameter. Essentially, each n-gram probability is derived by counting up the amount of times it appears and then dividing my the number of times the first (n - 1) tokens in said n-gram appears. For unigrams, where n = 1, you just divide the unigram count by the count of all other tokens.

Once all n-gram probabilities have been calculated for a corpus, text generation can begin. Sentences start with a token randomly chosen from a set of all tokens which started sentences in the training corpus. To generate the next token, we first narrow the possible set by selecting n-grams which start with the same first (n - 1) tokens that the preceding n-gram ended with (e.g., "that is cool" could be followed by "is cool that"). Once that set of probable continuations is calculated, it's chosen at random, but weighted according to each trigram's probability, using NumPy's random.choice function.

This process continues until a n-gram is chosen which ends in an ending tag, specifying that it was used to end a sentence in the training corpus.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
GroupMeTextGenerator		GroupMeTextGenerator
TextGenerator		TextGenerator
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
db.sqlite3		db.sqlite3
manage.py		manage.py
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

groupme-text-generator

About

Demo

How it Works

About

Releases

Packages

Languages

lpatino10/groupme-text-generator

Folders and files

Latest commit

History

Repository files navigation

groupme-text-generator

About

Demo

How it Works

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages