-
Notifications
You must be signed in to change notification settings - Fork 0
/
2_annotate_audio.qmd
383 lines (284 loc) · 15.7 KB
/
2_annotate_audio.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
---
title: "Part 2: Annotating the Audio"
number-offset: [0, 2, 0]
---
<br>
There are two sets of data going into these vowel plots:
1. Vowels pulled from the Lingthusiasm episode recordings, which were located in [Part 1](1_find_words.qmd)
2. Vowels from Gretchen & Lauren recording the Wells lexical set for me
The next steps are to trim the words out of the episode audio files for #1, then annotate the vowels for both #1 and #2.
### Setup
```{python}
#| label: imports
"""Part 2 of Lingthusiasm Vowel Plots: Trimming Audio and Getting Vowel Formants."""
import glob # <1>
import os # <1>
import pandas as pd # <2>
from pytube import Playlist, YouTube # <3>
from pydub import AudioSegment # <4>
import parselmouth # <5>
```
1. File utilities.
2. Dataframes.
3. Getting captions and audio data from YouTube.
4. Working with audio files.
5. Interface with Praat.
Get video info from Lingthusiasm's all episodes [playlist](https://www.youtube.com/watch?v=xHNgepsuZ8c&list=PLcqOJ708UoXQ2wSZelLwkkHFwg424u8tG):
```{python}
#| label: video-list
video_list = Playlist('https://www.youtube.com/watch?v=xHNgepsuZ8c&' +
'list=PLcqOJ708UoXQ2wSZelLwkkHFwg424u8tG')
```
Go through each video and download audio (if not already downloaded):
```{python}
#| label: download-audio
def get_audio(videos):
"""Download episode audio from Youtube."""
for url in videos:
video = YouTube(url) # <1>
video.bypass_age_gate() # <2>
title = video.title # <3>
episode = int(title[:2]) # <3>
audio_file_name = os.path.join( # <4>
'audio', 'episodes', f'{episode}.mp4')
if not os.path.isfile(audio_file_name): # <5>
audio_stream = video.streams.get_audio_only() # <5>
print(f'downloading {episode}') # <5>
audio_stream.download(filename=audio_file_name) # <5>
get_audio(video_list)
```
1. Go through the list of video URLs and open each one as a `YouTube` object.
2. Need to include this to download data.
3. The video title is an attribute of the `YouTube` object, and the episode number is the first word of the title.
4. Create file name for episode audio.
5. If file is not already downloaded, select and download the highest-quality audio-only stream.
### Trim Audio from Episodes
Open the `timestamps` data from Part 1:
```{python}
#| label: open-timestamps
timestamps = pd.read_csv(
'data/timestamps_annotate.csv',
usecols=[
'Vowel', 'Word', 'Speaker', 'Number', # <1>
'Episode', 'Start', 'End' # <2>
],
dtype={ # <3>
'Vowel': 'category', 'Word': 'category', 'Speaker': 'category', # <3>
'Number': 'category', 'Episode': 'category', # <3>
'Start': 'int', 'End': 'int' # <3>
} # <3>
)
timestamps['Speaker'] = timestamps['Speaker'] \
.str.replace('retchen', '').str.replace('auren', '') \
.astype('category') # <4>
```
1. Keep columns specifying word variables.
2. And keep columns specifying where audio is.
3. Make all columns categorical variables, except the `Start` and `End` times (integers).
4. Convert values in `Speaker` column from names to initials.
Trim audio for the duration of the caption (with 250ms before and after). This results ~240 audio files each 2-10sec long, each containing a target word.
```{python}
#| label: trim-audio
def trim_audio(df):
"""Use caption timestamps to trim audio."""
for i in df.index: # <1>
episode = df.loc[i, 'Episode'] # <1>
word = df.loc[i, 'Word'] # <1>
speaker = df.loc[i, 'Speaker'] # <1>
count = df.loc[i, 'Number'] # <1>
out_file = os.path.join( # <2>
'audio', 'words', f'episode_{word}_{speaker}_{count}.wav')
if not os.path.isfile(out_file): # <2>
in_file = os.path.join('audio', 'episodes', f'{episode}.mp4') # <3>
audio = AudioSegment.from_file(in_file, format='mp4') # <3>
start = max(df.loc[i, 'Start'] - 250, 0) # <4>
end = min(len(audio), df.loc[i, 'End'] + 250) # <4>
clip = audio[start:end] # <4>
clip.export(out_f=out_file, format='wav') # <4>
trim_audio(timestamps)
```
1. Go through dataframe that has the example words to annotate and their timestamps.
2. Make file name for current word, and if it does not already exist...
3. Open the audio file for the whole episode.
4. Trim the episode audio to start 250 ms after the caption timestamp and end 250 after the caption timestamp; save it.
### Wells Lexical Set
The Wells lexical set is a set of examples for each vowel/diphthong, chosen to be maximally distinguishable and consistent. You can read more about it on [Wikipedia](https://en.wikipedia.org/wiki/Lexical_set#Standard_lexical_sets_for_English) and [John Wells' blog](https://phonetic-blog.blogspot.com/2010/02/lexical-sets.html). These recordings are going to be more controlled than the vowels pulled from the episode recordings and easier to annotate because they're spoken more slowly and carefully. This set contains some fairly low-frequency words, which is why there's not a lot of overlap with the words pulled from the episodes.
[![The Wells lexical set](resources/wells_lexical_set.jpg)](http://2.bp.blogspot.com/_RSOXNV65lN0/S2a13vcLBAI/AAAAAAAAAYg/RQo2sbM7cqM/s1600-h/sets.jpg)
```{python}
#| label: list-wells-lexical-set
wells_lexical_set = { # <1>
'\u0069': 'fleece', # i
'\u026A': ['kit', 'near'], # ɪ
'\u025B': ['dress', 'square'], # ɛ
'\u00E6': ['trap', 'bath'], # æ
'\u006F': ['force', 'goat'], # o
'\u0075': 'goose', # u
'\u028A': ['cure', 'foot'], # ʊ
'\u0254': ['cloth', 'north', 'thought'], # ɔ
'\u0251': ['lot', 'palm', 'start'], # ɑ
'\u028C': 'strut' # ʌ
}
wells_lexical_set = pd.DataFrame.from_dict(wells_lexical_set, orient='index') \
.rename(columns={0: 'Word'}) \
.explode('Word') \
.reset_index(names='Vowel') # <2>
```
1. Dictionary where keys are the IPA vowel unicode and values is word(s).
2. Convert to dataframe with columns for `Vowel` and `Word`.
Here's the full set of words for each vowel:
```{python}
#| label: all-words-list
word_list = pd.concat([ # <1>
pd.DataFrame({ # <2>
'List': 'lexicalset', # <2>
'Vowel': wells_lexical_set['Vowel'], # <2>
'Word': wells_lexical_set['Word'] # <2>
}), # <2>
timestamps[['Vowel', 'Word']].drop_duplicates() # <3>
])
word_list = word_list.fillna('episode') # <4>
word_list = word_list.sort_values(by = ['Vowel', 'Word']) # <5>
word_list = word_list.reset_index(drop = True) # <5>
word_list.style.hide() # <6>
```
1. Combine word lists from episodes and Wells lexical set.
2. Dataframe for Wells lexical set with columns for `List`, `Vowel`, and `Word`.
3. Subset dataframe for episode word list, with columns for `Vowel` and `List`.
4. Fill the `NA` values of `List` with `episode`.
5. Sort and reset index.
6. Print, not including index.
### An Interlude in Praat
Now, we have about 400 audio clips of Gretchen and Lauren saying words that are good examples of each vowel. The vowel data that will actually be going into the plots is the lowest two formants: F1 and F2 (see the bonus episode about vowel plots and the main episode about vowels for an explanation of what formants are). The easiest way to calculate the vowel formants is using Praat, a software designed for doing phonetic analysis.
Here's an example of what that looked like:
![Praat screenshot of Lauren saying "pit."](resources/praat_screenshot.png)
The vowel [ɪ] is highlighted in pink, and you can see it's darker on the spectrogram than the consonants [k] and [t] before and after it. I placed an annotation (the blue lines, which Praat calls a "point tier") right in the middle of the vowel sound---this one is pretty easy, because Lauren was speaking slowly and without anyone overlapping, so the vowel sound is long and clear.
The formants are the lines of red dots, and the popup window is the formant values at the vowel annotation time. We'll be using F1 (the bottom one) and F2 (the second from the bottom).
You can download Praat and see the documentation [here](https://www.fon.hum.uva.nl/praat/). It's fairly old, so a con is that the interface isn't necessarily intuitive if you're used to modern programs, but a pro is that there are a ton of resources available for learning how to use it. [Here's a tutorial](https://aletheiacui.github.io/tutorials/segmentation_with_praat.html) about getting started with Praat, and [here's one](https://home.cc.umanitoba.ca/~krussll/phonetics/practice/praat.html) for recording your own audio and calculating the formants in it.
After going through all of the audio clips, I had a .TextGrid file (Praat's annotation file format) for each audio clip that includes the timestamp for middle(ish) of the vowel. You can copy formant values manually out of Praat, or you can use Praat scripting to export them to a csv file (see [this tutorial](https://joeystanley.com/blog/a-tutorial-on-extracting-formants-in-praat/), for example). But I prefer to go back to Python instead of wrangling Praat scripting code.
### Read Annotation Data
There are packages that read Praat TextGrid files, but I kept getting errors. Luckily, the textgrids for these annotations are simple text files, where we only need to extra one variable (the time of the point tier). These two functions do that:
```{python}
#| label: functions-textgrids-1
def get_tier(text):
"""Get annotation info from Praat TextGrid."""
tg = text.split('class = "TextTier"')[1] # <1>
tg = tg.splitlines()[1:] # <2>
tg = pd.Series(tg) # <2>
tg = tg.str.partition('=') # <3>
tg.drop(columns=1, inplace=True) # <3>
tg.rename(columns={0: 'Name', 2: 'Value'}, inplace=True) # <3>
tg['Name'] = tg['Name'].str.strip() # <4>
tg['Value'] = tg['Value'].str.strip() # <4>
tg.set_index('Name', inplace=True) # <5>
return tg['Value'].to_dict() # <5>
def get_point_tier_time(t):
"""Get time from TextGrid PointTier."""
tg = get_tier(t) # <6>
time = tg['number'] # <7>
time = float(time) # <8>
return round(time, 4) # <8>
```
1. Section we need in TextGrid files start with this string.
2. Split string by line breaks and convert to pandas series.
3. Split into columns by `=` character, where the first column is the variable name, the second column is `=` (and gets dropped), and the third column is the variable value.
4. Remove extra whitespace.
5. Make `Name` into index, so the dataframe can be converted to a dictionary.
6. Read TextGrid file using function defined immediately above.
7. The variable we want (the timestamp for the PointTier annotation) is called `number`.
8. Convert the time from character to numeric and round to 4 digits.
These functions cycle through the list of TextGrid files, extract the point tier times, and put them into a dataframe with the rest of the information for each word:
```{python}
#| label: functions-textgrids-2
def read_textgrid_times(file_list, word_list):
"""Read textgrid files into dataframe."""
tg_times = []
for file_name in file_list:
with open(file_name, encoding='utf-8') as t_file:
t = t_file.read()
try:
tg = get_point_tier_time(t) # <1>
except KeyError:
tg = None
tg_times.append(tg)
df = pd.DataFrame({'File': file_list, 'Vowel_Time': tg_times}) # <2>
df['File'] = df['File'].str.rpartition('\\')[2] # <3>
df['File'] = df['File'].str.removesuffix('.TextGrid') # <3>
return textgrid_vars(df, word_list) # <4>
def textgrid_vars(df, word_list):
"""Format df of vowel timestamps."""
df['List'] = df['File'].str.split('_', expand=True)[0] # <5>
df['Word'] = df['File'].str.split('_', expand=True)[1] # <5>
df['Speaker'] = df['File'].str.split('_', expand=True)[2] # <5>
df['Count'] = df['File'].str.split('_', expand=True)[3] # <5>
df = pd.merge(df, word_list, how='left', on=['Word', 'List']) # <6>
return df[['List', 'Vowel', 'Word', 'Speaker', 'Count', 'Vowel_Time']] # <7>
```
1. Try to get timestamp from PointTier annotation for each TextGrid file, using function defined in previous code chunk.
2. Put results into a dataframe.
3. Remove the path prefix and type suffix from the file names, leaving just the `word_speaker_number` format.
4. Get other variables from `File`, using function defined immediately below.
5. When `File` is split by `_`, `List` (`episode` or `lexicalset`) is the first item, `Word` is the second item, `Speaker` (`G` or `L`) is the third item, and `Count` is the fourth item in the resulting list.
6. Merge with the word list dataframe to add a column for `Vowel` by matching on `Word`.
7. Organize.
Make list of TextGrid files and read PointTier times:
```{python}
#| label: read-textgrid-times
tg_list = glob.glob(os.path.join('audio', 'words', '*.TextGrid')) # <1>
formants = read_textgrid_times(tg_list, word_list) # <2>
pd.concat([formants.head(), formants.tail()]).style.hide() # <3>
```
1. Get a list of all `.TextGrid` files in the `audio/words/` directory.
2. Get the data from each TextGrid file, using functions defined in previous two code chunks.
3. Get the first and last 10 rows, then display as a table, not including row numbers.
### Calculate Formants
It's possible to export the formants from Praat, but I think it's easier to use the [parselmouth package](https://parselmouth.readthedocs.io/en/stable/index.html) here, which runs Praat from Python.
```{python}
#| label: function-calculate-formants
def get_formants(df):
"""Get F1 and F2 at specified time."""
for i in df.index:
file = os.path.join( # <1>
'audio', 'words', # <1>
(df.loc[i, 'List'] + '_' + df.loc[i, 'Word'] + '_' + # <1>
df.loc[i, 'Speaker'] + '_' + df.loc[i, 'Count'] + '.wav') # <1>
)
audio = parselmouth.Sound(file) # <2>
formants = audio.to_formant_burg( # <3>
time_step=0.01, max_number_of_formants=5 # <3>
) # <3>
if 'fleece_L_2' in file: # <4>
formants = audio.to_formant_burg( # <4>
time_step=0.01, max_number_of_formants=4 # <4>
) # <4>
vowel_time = df.loc[i, 'Vowel_Time'] # <5>
df.loc[i, 'F1'] = formants.get_value_at_time(1, vowel_time) # <6>
df.loc[i, 'F2'] = formants.get_value_at_time(2, vowel_time) # <6>
return df
```
1. Reconstruct the current audio file name from `Word` + `Speaker` + `Count` variables.
2. Use the parselmouth package to open the audio file.
3. Call Praat via parselmouth and calculate the formants. These are the default settings: every 0.010 seconds, up to 5 formants.
4. Because of artefacts in the recording, this file needs a limit of 4 formants to identify them correctly.
5. Get the timestamp of the vowel for the current audio file.
6. Get F1 and F2 at the specified time from the parselmouth formant object.
Calculate the formants and summarize the results:
```{python}
#| label: calculate-formants
formants = get_formants(formants) # <1>
pd.concat([formants.head(), formants.tail()]).style.hide() # <2>
```
1. Calculate vowel formants for each word, using function defined in previous code chunk.
2. Get the first and last 10 rows, then display as a table, not including row numbers.
```{python}
#| label: formant-summary
formants.groupby(['Vowel', 'Speaker']) \
.agg({'F1': ['min', 'mean', 'max'], 'F2': ['min', 'mean', 'max']}) \
.style \
.format(precision=0) # <1>
```
1. Group the data by `Vowel` and `Speaker`, then calculate the min, mean, and max of `F1` and `F2` for each `Vowel` + `Speaker` combination. Print the results as a table, with values rounded to whole numbers.
Save results as `data/formants.csv`:
```{python}
#| label: save-formant-results
formants.to_csv('data/formants.csv', index=False)
```