- Fundamental Concepts
- Topic Detection
- Sentiment Analysis
- Tagging
- Word Networks
- Correction & Prediction
- Stemming
- Vectorization
- Chatbots
- Speech
For each library that requires installation, the parenthesis indicates the sessions that employ the package.
a. Python — Create a new Colab in Python
gutenbergpy
(S01)nltk
(S01)wordcloud
(S01)matplotlib
(S01)numpy
(S01)pandas
(S01)speechrecognition
(S10)pyaudio
(S10)pyttsx3
(S10)scipy
(S10)ffmpeg-python
(S10)
b. R — Create a new Colab in R
gutenbergr
(S01)tidytext
(S01)ggplot2
(S01)quanteda
(S01)quanteda.textplot
(S01)tm
(S02)reshape
(S02)reshape2
(S02)topicmodels
(S02)wordcloud
(S02)RColorBrewer
(S02)textdata
(S03)reshape2
(S03)igraph
(S04)stopwords
(S08)plot.matrix
(S08)proxy
(S08)word2vec
(S08)plot3D
(S08)NbClust
(S08)factoextra
(S08)
You can either run things on an online environment like Google Colab
or install both of these open-source tools on your own computer. Note
that some installable packages come pre-installed for the Colab Python
environment (like pandas and numpy) but need to be installed with
pip
if you set up your own environment.
Text Mining with R | NLP with Python |
Additionally to the above books available through the library, we use the freely available online versions of the following textbooks:
- Speech and Language Processing referred to as SLP
- Supervised Machine Learning for Text Analysis in R referred to as SMLTAR
- Project Gutenberg: the raw text of numerous books
- Open Multilingual Wordnet:
WordNet
data in other languages
The contents of each assignment is detailed on myCourses and also as a single file on Overleaf for easier access so you can prepare your responses offline.
- token = a meaningful unit (of text)
- tokenization = the process of extracting tokens from text
- string = a data representation for a sequence of characters
- metadata = tags or other type of data associated to a string or a token describing its origin, meaning, or some other characteristic thereof
- corpus = a collection of textual data that contains strings, possibly with associated metadata
- stopword = a word the presence of which is deemed meaningless in a given context
- term-document matrix = a matrix in which each row represents a document and each column represents a term, with the cells indicating the frequency of occurrence of each term in each document
- tf-idf = term-frequency versus inverse-document-frequency matrix that assigns higher weight for terms that are not frequent across all of the documents; the idf is the natural logarithm of the fraction of total number of documents divided by the number of documents that contain a term
- LDA = Latent Dirichlet Allocation, a topic-modeling algorithm: represent a document as a mixture of topics and a topic as a mixture of words
- lexicon = a set of words, a vocabulary
- unigram = a unit of language that is a single word
- part of speech (POS) = lexical category = word class = the "grammar classes" of words such as nouns, adverbs, verbs, adjectives, etc.
- bigram = a two-word sequence
- n-gram = a sequence of n words
WordNet
= a graph-format thesaurus of relationships between English words- hyponym = a more specific synonym of a word
- hypernum = a more general synonym of a word
- meronym = component of a concept
- holonym = container of a concept
- antonym = the counterpart of a word (the contrary version: vertical/horizontal, positive/negative)
- edit distance = the total cost of alterations that need to be made on a string to convert it into another one
- stemming = removal of affixes (suffixes, mostly; sometimes prefixes) to cut down all variants of a word to their "common core"
- normalization = a process of regularizing text in some way, such as making all of it lowercase
- lemmatization = taking each (conjugated, plural, capitalized, ...) word into the form in which it would appear in a dictionary
- cosine similarity = the dot product of two numerical vectors divided by the product of their norms
- PMI = pointwise mutual information, a measure to quantify how often do two words appear together than what one would expect if they were ordered at random
- lemma = the "dictionary form" of a word
- wordform = the specific variant of a word, such as the conjugated form, that may not be a lemma as such
- polysemous = having more than one meaning
- (word) embedding = a vectorization of a text that attempts to capture semantics based on word context, such as
word2vec
- skipgram window = a token subsequence of a determined length
- skipgram probability = the probability (relative frequency) of two tokens appearing together in a skipgram window
- reflection = a word pair in which one serves as a response to the other in the sense that if the point of view of the speaker is reversed, the substitution maintains consistency ("this is my dog", "your dog is cute")
- rule-based chatbot = one that picks a response to an incoming message based on a set of rules, often expressed in terms of regular expressions
- self-learning chatbot = one that uses machine learning to determine responses
- bag of words = representing a part of text (sentence, document, etc.) as the set of words it contains (or a binary representation thereof)
- text to speech = have the computer read out loud a text given as input
- speech to text = have the computer create a string from a recording (live or file) of spoken language