← Make Your Own Corpus | Part-of-Speech Tagging →
Now we are going to transform that string into a text that we can perform NLTK functions on. Since we already imported nltk at the beginning of our program, we don't need to import it again, we can just use its functions by specifying nltk
before the function. The first step is to tokenize the words, transforming the giant string into a list of words. A simple way to do this would be to split on spaces, and that would probably be fine, but we are going to use the NLTK tokenizer to ensure that edge cases are captured (i.e., "don't" is made into 2 words: "do" and "n't"). In the next cell, type:
don_tokens = nltk.word_tokenize(don)
You can check out the type of don_tokens
using the type()
function to make sure it worked—it should be a list. Let's see how many words there are in our novel:
len(don_tokens)
Since this is a list, we can look at any slice of it that we want. Let's inspect the first ten words:
don_tokens[:10]
That looks like metadata—not what we want to analyze. We will strip this off before proceeding. If you were doing this to many texts, you would want to use Regular Expressions. Regular Expressions are an extremely powerful way to match text in a document. However, we are just using this text, so we could either guess, or cut and paste the text into a text reader and identify the position of the first content (i.e., how many words in is the first word). That is the route we are going to take. We found that the content begins at word 320, so let's make a slice of the text from word position 320 to the end.
dq_text = don_tokens[320:]
Now print the first 30 words to see if it worked:
print(dq_text[:30]
Finally, if we want to use the NLTK specific functions:
concordance
similar
dispersion_plot
- or others from the NLTK book
we would have to make a specific NLTK Text
object.
dq_nltk_text = nltk.Text(dq_text)
And we could check that it worked by running:
type(dq_nltk_text)
But if we only need to use the built-in Python functions, we can just stick with our list of words in dq_text
.
Using the dq_text
variable:
- Remove the stop words
- Remove punctuation
- Remove capitalization
- Lemmatize the words
If you want to spice your challenge up, do the first three operations in a single if statement. Google "python nested if statements" for examples.
-
Lowercase, remove punctuation and stop words:
dq_clean = [] for word in dq_text: if word.isalpha(): if word.lower() not in stops: dq_clean.append(word.lower()) print(dq_clean[:50])
-
Lemmatize:
from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() dq_lemmatized = [] for t in dq_clean: dq_lemmatized.append(wordnet_lemmatizer.lemmatize(t))
Check all sentences below that are correct:
urlopen
can save the contents of a webpage into a variable.*- To use NLTK functions on a string, we can transform it into a NLTK Text object.*
- NLTK let's you tokenize (split) a giant string into a list of substrings, considering punctuations and edge cases like
don't
.*
Do you remember the glossary terms from this section?