Natural Language Processing Fundamentals


  1. Fundamental Concepts
  2. Topic Detection
  3. Sentiment Analysis
  4. Tagging
  5. Word Networks
  6. Correction & Prediction
  7. Stemming
  8. Vectorization
  9. Chatbots
  10. Speech

Tools and libraries

For each library that requires installation, the parenthesis indicates the sessions that employ the package.

You can either run things on an online environment like Google Colab or install both of these open-source tools on your own computer. Note that some installable packages come pre-installed for the Colab Python environment (like pandas and numpy) but need to be installed with pip if you set up your own environment.


R Python
Text Mining with R NLP with Python

Additionally to the above books available through the library, we use the freely available online versions of the following textbooks:

Data sets


The contents of each assignment is detailed on myCourses and also as a single file on Overleaf for easier access so you can prepare your responses offline.


Session 1: Fundamental Concepts

  • token = a meaningful unit (of text)
  • tokenization = the process of extracting tokens from text
  • string = a data representation for a sequence of characters
  • metadata = tags or other type of data associated to a string or a token describing its origin, meaning, or some other characteristic thereof
  • corpus = a collection of textual data that contains strings, possibly with associated metadata
  • stopword = a word the presence of which is deemed meaningless in a given context
  • term-document matrix = a matrix in which each row represents a document and each column represents a term, with the cells indicating the frequency of occurrence of each term in each document

Session 2: Topic Detection

  • tf-idf = term-frequency versus inverse-document-frequency matrix that assigns higher weight for terms that are not frequent across all of the documents; the idf is the natural logarithm of the fraction of total number of documents divided by the number of documents that contain a term
  • LDA = Latent Dirichlet Allocation, a topic-modeling algorithm: represent a document as a mixture of topics and a topic as a mixture of words

Session 3: Sentiment Analysis

  • lexicon = a set of words, a vocabulary
  • unigram = a unit of language that is a single word

Session 4: Tagging

  • part of speech (POS) = lexical category = word class = the "grammar classes" of words such as nouns, adverbs, verbs, adjectives, etc.
  • bigram = a two-word sequence
  • n-gram = a sequence of n words

Session 5: Word Networks

  • WordNet = a graph-format thesaurus of relationships between English words
  • hyponym = a more specific synonym of a word
  • hypernum = a more general synonym of a word
  • meronym = component of a concept
  • holonym = container of a concept
  • antonym = the counterpart of a word (the contrary version: vertical/horizontal, positive/negative)

Session 6: Correction & Prediction

  • edit distance = the total cost of alterations that need to be made on a string to convert it into another one

Session 7: Stemming

  • stemming = removal of affixes (suffixes, mostly; sometimes prefixes) to cut down all variants of a word to their "common core"
  • normalization = a process of regularizing text in some way, such as making all of it lowercase
  • lemmatization = taking each (conjugated, plural, capitalized, ...) word into the form in which it would appear in a dictionary

Session 8: Vectorization

  • cosine similarity = the dot product of two numerical vectors divided by the product of their norms
  • PMI = pointwise mutual information, a measure to quantify how often do two words appear together than what one would expect if they were ordered at random
  • lemma = the "dictionary form" of a word
  • wordform = the specific variant of a word, such as the conjugated form, that may not be a lemma as such
  • polysemous = having more than one meaning
  • (word) embedding = a vectorization of a text that attempts to capture semantics based on word context, such as word2vec
  • skipgram window = a token subsequence of a determined length
  • skipgram probability = the probability (relative frequency) of two tokens appearing together in a skipgram window

Session 9: Chatbots

  • reflection = a word pair in which one serves as a response to the other in the sense that if the point of view of the speaker is reversed, the substitution maintains consistency ("this is my dog", "your dog is cute")
  • rule-based chatbot = one that picks a response to an incoming message based on a set of rules, often expressed in terms of regular expressions
  • self-learning chatbot = one that uses machine learning to determine responses
  • bag of words = representing a part of text (sentence, document, etc.) as the set of words it contains (or a binary representation thereof)

Session 10: Speech

  • text to speech = have the computer read out loud a text given as input
  • speech to text = have the computer create a string from a recording (live or file) of spoken language

Other NLP Courses at McGIll


