-
Notifications
You must be signed in to change notification settings - Fork 0
/
1_find_words.qmd
744 lines (564 loc) · 31.9 KB
/
1_find_words.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
---
title: "Part 1: Finding Vowels to Plot"
number-offset: [0, 1, 0]
---
<br>
There are going to be two sources of data, one more naturalistic and one more controlled:
1. Vowels pulled from the Lingthusiasm episode recordings
2. Vowels from Gretchen & Lauren recording the [Wells lexical set](https://en.wikipedia.org/wiki/Lexical_set)
The first steps are to find the words for #1, and we'll come back to #2 later.
### Setup
```{python}
#| label: imports
"""Part 1 of Lingthusiasm Vowel Plots: Finding Vowels to Annotate."""
import re # <1>
import pandas as pd # <2>
import requests # <3>
from bs4 import BeautifulSoup # <3>
from thefuzz import process # <4>
from pytube import Playlist, YouTube # <5>
```
1. Regex functions.
2. Dataframe functions.
3. Scraping data from webpages.
4. Fuzzy string matching.
5. Getting captions and audio from YouTube.
(Note: All of this could also be done in R, but I've done it in Python here primarily because there are Python packages that can get data from YouTube without having to set up anything on the YouTube API side. Here, all you need to do is install the package.)
### Get Transcript Text & Speakers
#### List Episodes
The Lingthusiasm website has a [page](https://lingthusiasm.com/transcripts) listing all of the available transcripts. Step 1 is to load that page and get the list of URLs to the transcripts. (They have similar but not identical structures, so it's easiest to read the list from the website instead of trying to construct them here.)
This function uses the [BeautifulSoup package](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to return an HTML object and a text string for a URL:
```{python}
#| label: function-get-html-text
def get_html_text(url):
"""Use BeautifulSoup to get the webpage text from the URL."""
resp = requests.get(url, timeout=1000) # <1>
html = BeautifulSoup(resp.text, 'html.parser') # <2>
return {'html': html, 'text': html.get_text()} # <3>
```
1. Connect to webpage.
2. Load the HTML data from the webpage.
3. Return the HTML data object and the text from the HTML data object.
This function uses BeautifulSoup to filter just the transcript URLs from the HTML data:
```{python}
#| label: function-get-transcript-links
def get_transcript_links(url):
"""Get URLs to episode transcripts from HTML page."""
html = get_html_text(url)['html'] # <1>
url_objs = html.find_all('a') # <2>
urls = [l.get('href') for l in url_objs] # <3>
urls = pd.Series(data=urls, name='URL', dtype='string') # <4>
urls = urls[urls.str.contains('transcript')] # <5>
urls = urls[::-1] # <6>
return urls.reset_index(drop=True) # <7>
```
1. Get HTML data from webpage using function defined above.
2. Filter using the `a` tag to get all of the link items.
3. Get the `href` item from each `a` tag item, which is the text of the URL.
4. Convert from list to pandas series.
5. Filter to only include URLs including the word `transcript`.
6. Sort to have earliest episodes first.
7. Return with index (row numbers) reset.
There are 84 episodes available (as of early October 2023):
```{python}
#| label: get-links
transcript_urls = get_transcript_links('https://lingthusiasm.com/transcripts') # <1>
transcript_urls.head().to_frame().style # <2>
```
1. Get the URLs for the episode transcripts from the table of contents page, using the functions defined above.
2. Take the first 10 rows of the resulting list and print it as a nice table.
Only keep episodes 1-84, for consistency replicating this code later:
```{python}
#| label: subset-links
transcript_urls = transcript_urls[:84]
```
#### Scrape Transcript Text
Now we can download transcripts, which are split into turns and labelled by speaker.
Splitting out some sub-tasks into separate methods, this function gets the episode number (following `episode-`) as an integer from the URL:
```{python}
#| label: function-transcript-number-from-url
def transcript_number_from_url(url):
"""Find transcript number in URL."""
index_episode = url.find('episode-') # <1>
cur_number = url[index_episode + 8:] # <2>
index_end = cur_number.find('-') # <3>
if index_end != -1:
cur_number = cur_number[:index_end]
return int(cur_number) # <4>
```
1. Find location of `episode-`, since episode number is immediately following it.
2. Subset URL starting 8 characters after start of `episode-` string, which is immediately after it.
3. Most of the transcript URLs have more text after the number. If so, trim to just keep the number.
4. Return episode number converted from string to integer.
The text returned from the transcript pages has information at the top and bottom that we don't need. `"This is a transcript for"` marks the start of the transcript, and `"This work is licensed under"` marks the end, so subset the transcript dataframe rows to only include that section:
```{python}
#| label: function-trim-transcript
def trim_transcript(df):
"""Find start and end of transcript text."""
start_index = df.find('This is a transcript for')
end_index = df.find('This work is licensed under')
return df[start_index:end_index]
```
This function cleans up the text column so it plays a bit nicer with Excel CSV export:
```{python}
#| label: function-clean-text-col
def clean_text(l):
"""Clean text column so it opens correctly as Excel CSV."""
l = l.strip() # <1>
l = l.replace('\u00A0', ' ').replace('–', '--').replace('…', '...') # <2>
l = re.sub('“|”|‘|’', "'", l) # <3>
return ' '.join(l.split()) # <4>
```
1. Remove leading and trailing whitespaces.
2. Replace non-breaking spaces, en dashes, and ellipses.
3. Replace slanted quotes.
4. Remove double spaces between words.
After a bit of trial and error, it's easiest to split on the speaker names, since the paragraph formatting isn't identical across all the pages:
```{python}
#| label: speaker-names
speaker_names = [
'Gretchen', 'Lauren', 'Ake', 'Bona', 'Ev', 'Fei Ting', 'Gabrielle',
'Jade', 'Hannah', 'Hilaria', 'Janelle', 'Kat', 'Kirby', 'Lina', 'Nicole',
'Pedro', 'Randall', 'Shivonne', 'Suzy'
]
speaker_regex = '(' + ':)|('.join(speaker_names) + ':)' # <1>
```
1. Regex looks like `(Name1:)|(Name2:)` etc. Surrounding names in parentheses makes it a capture group, so the names are included in the list of split items instead of dropped.
This function puts it all together to read the transcripts into one dataframe:
```{python}
#| label: function-parse-transcripts
def get_transcripts(urls):
"""Get transcript text from URLs and format into dataframe."""
df = [] # <1>
for l in urls:
cur_text = get_html_text(l)['text'] # <2>
cur_text = trim_transcript(cur_text) # <3>
cur_lines = re.split(speaker_regex, cur_text) # <4>
cur_lines = [l for l in cur_lines if l is not None] # <5>
cur_lines = cur_lines[1:] # <6>
speakers = cur_lines[::2] # <7>
turns = cur_lines[1::2] # <7>
cur_df = pd.DataFrame({ # <8>
'Episode': transcript_number_from_url(l),
'Speaker': speakers,
'Text': [clean_text(line) for line in turns],
})
cur_df['Speaker'] = cur_df['Speaker'].str.removesuffix(':') # <8>
cur_df['Turn'] = cur_df.index + 1 # <9>
cur_df = cur_df.set_index(['Episode', 'Turn'], drop=True) # <10>
df.append(cur_df) # <11>
df = pd.concat(df) # <12>
df['Speaker'] = df['Speaker'].astype('category') # <13>
df['Text'] = df['Text'].astype('string')
return df
```
1. Make list to store results.
2. Read text string (vs HTML object) from URL, using function defined above.
3. Trim to only include transcript section, using function defined above.
4. Split string into list with one item for each speaker turn, using regex defined above.
5. Drop items in list that are `None` (which are included because of capture group syntax).
6. Drop the initial "This is a transcript for..." line.
7. Now all the odd items are speaker labels and even items are turn text. `[::2]` is every 2nd item starting at 0, and `[1::2]` is every 2nd item starting at 1.
8. Clean up strings and put everything into a dataframe.
9. Make column for turn number, which is index + 1.
10. Set `Episode` and `Turn` columns as indices.
11. Add to list of parsed episodes.
12. Combine list of dataframes into one dataframe. This works because the `Episode` + `Turn` index combination is unique.
13. Set datatypes explicitly, to avoid warnings later.
This takes a minute or so to run:
```{python}
#| label: get-transcript-text
transcripts = get_transcripts(transcript_urls) # <1>
pd.concat([transcripts.head(), transcripts.tail()]).style # <2>
```
1. Run `get_transcripts()` on the lists of links to each transcript, defined above.
2. Show the first and last 10 rows of the resulting dataframe. `.style` prints it as a nicer-looking table.
Save results to `data/transcripts.csv`:
```{python}
#| label: save-transcripts
transcripts.to_csv('data/transcripts.csv', index=True)
```
### Find Words For Each Vowel
#### Which Words To Use?
These words aren't all as controlled as you'd probably want for an experiment, but they're high frequency enough that there are multiple tokens for each one in the episodes.
```{python}
#| label: word-list
word_list = { # <1>
'\u0069': { # i # <2>
'beat': r'\bbeat\b', # <3>
'believe': r'\bbelieve\b',
'people': r'\bpeople\b'},
'\u026A': { # ɪ
'bit': r'\bbit\b',
'finish': r'\bfinish', # <4>
'pin': r'\bpin\b'},
'\u025B': { # ɛ
'bet': r'\bbet\b',
'guest': r'\bguest', # <5>
'says': r'\bsays\b'},
'\u00E6': { # æ
'bang': r'\bbang\b',
'hand': r'\bhand\b',
'laugh': r'\blaugh\b'},
'\u0075': { # u
'blue': r'\bblue\b',
'through': r'\bthrough\b',
'who': r'\bwho\b'},
'\u028A': { # ʊ
'could': r'\bcould\b',
'foot': r'\bfoot\b',
'put': r'\bput\b'},
'\u0254': { # ɔ
'bought': r'\bbought\b',
'core': r'\bcore\b',
'wrong': r'\bwrong\b'},
'\u0251': { # ɑ
'ball': r'ball\b|\bballs\b',
'father': r'\bfather\b',
'honorific': r'\bhonorific'}, # <6>
'\u028C': { # ʌ
'another': r'\banother\b',
'but': r'\bbut\b',
'fun': r'\bfun\b'},
'\u0259': { # ə
'among': r'\bamong\b',
'famous': r'\bfamous\b',
'support': r'\bsupport\b'}
}
```
1. Set up a nested dictionary, where the outer keys are the IPA vowels, the inner keys are the words, and the values are the regex strings for the words.
2. `\u` makes it a unicode string.
3. `\b` means search at word boundaries, and only a few words here also include suffixes (e.g., only return results for `bit`, but return results for `finish`, `finishes`, and `finishing`).
4. Also get `finishes`/`finishing`.
5. Also get `guests`.
6. Also get `honorifics`. Using `honorific` instead of `honor` because there's an episode about honorifics.
#### Find Words In Transcript
This function goes through the transcript dataframe and finds turns by `Gretchen` or `Lauren` containing the target words:
```{python}
#| label: function-filter-words
def filter_for_words(words, df_all):
"""Find turns by Gretchen/Lauren that contain the target words."""
df_gl = df_all[ # <1>
(df_all['Speaker'] == "Gretchen") |
(df_all['Speaker'] == "Lauren")
]
df_words = pd.DataFrame() # <2>
for vowel, examples in words.items(): # <3>
for word, reg in examples.items(): # <4>
has_word = df_gl[ # <5>
df_gl['Text'].str.contains(pat=reg, flags=re.IGNORECASE)
]
has_word.insert(0, 'Vowel', vowel) # <6>
has_word.insert(1, 'Word', word) # <6>
has_word.set_index(['Word', 'Vowel'], inplace=True, append=True) # <6>
has_word.index = has_word.index.reorder_levels( # <6>
['Vowel', 'Word', 'Episode', 'Turn'])
df_words = pd.concat([df_words, has_word]) # <7>
return df_words.sort_index() # <8>
```
1. Filter to only include rows where `Speaker` is `Gretchen` or `Lauren`, not guests.
2. Create dataframe to store results.
3. Loop through outer layer of dictionary, where keys are vowels and values are lists of example words.
4. Loop through inner dictionaries, where keys are the words and values are the regexes.
5. Filter to only include rows where `Text` column matches word regex. `re.IGNORECASE` means the search is not case-sensitive.
6. Make columns for the current `Vowel` and `Word`, then add them to the index. The results dataframe is now indexed by unique combinations of `Vowel`, `Word`, `Episode`, and `Turn`.
7. Add results for current word to dataframe of all results.
8. Sorting the dataframe indices makes search faster later, and pandas will throw warnings if you search this large-ish dataframe before sorting.
Next, this function trims the conversation turn around the target word, so it can be matched to the caption timestamps later:
```{python}
#| label: function-trim-turns
def trim_turn(df):
"""Find location of target word in conversation turn and subset -/+ 25 characters around it."""
df['Text_Subset'] = pd.Series(dtype='string') # <1>
for vowel, word, episode, turn in df.index: # <2>
text_full = df.loc[vowel, word, episode, turn]['Text'] # <3>
word_loc = re.search(re.compile(word, re.IGNORECASE), text_full) # <4>
word_loc = word_loc.span()[0] # <4>
sub_start = max(word_loc - 25, 0) # <5>
sub_end = min(word_loc + 25, len(text_full)) # <6>
text_sub = text_full[sub_start:sub_end] # <7>
df.loc[(vowel, word, episode, turn), 'Text_Subset'] = str(text_sub) # <7>
return df
```
1. Make a new column for the subset text, so it can be specified as a string. If you insert values later without initializing the column as a string, pandas will throw warnings.
2. Go through each row of the dataframe with all the transcript turns by Gretchen or Lauren that contain target words.
3. Use index values to get the value of `Text`.
4. Search for the current row's `Word` in the current row's `Text` (again case-insensitive). This returns a search object, and the first item in that tuple is the location of the first letter of `Word` in `Text`.
5. Start index of `Text_Subset` is 25 characters before the start of the `Word`, or the beginning of the string, whichever is larger.
6. End index of `Text_Subset` is 25 characters after the start of the `Word`, or the end of the string, whichever is smaller.
7. Insert `Text_Subset` into dataframe.
Subset the transcripts dataframe to only include turns by Gretchen/Lauren that contain a target word:
```{python}
#| label: filter-transcript
transcripts_subset = filter_for_words(word_list, transcripts) # <1>
transcripts_subset = trim_turn(transcripts_subset) # <1>
pd.concat([transcripts_subset.head(), transcripts_subset.tail()]).style # <2>
```
1. Run the two functions just defined on the dataframe of all the transcripts, to filter to only include turns by Gretchen or Lauren that contain the target words, then trim the text of each turn around the target word.
2. Show the first and last 10 rows of the resulting dataframe. `.style` prints it as a nicer-looking table.
Here's how many tokens we have for each speaker and word:
```{python}
#| label: word-counts
transcripts_subset \
.value_counts(['Vowel', 'Word', 'Speaker'], sort=False) \
.to_frame('Tokens') \
.reset_index(drop=False) \
.style \
.hide() # <1>
```
1. Count the number of items in each combination of `Vowel`, `Word`, and `Speaker`. Convert those results to a dataframe and call the column of counts `Tokens`. Change `Vowel`, `Word`, and `Speaker` from indices to columns. Print the results as a table, not including row numbers.
Gretchen and Lauren each say all of the words more than once, and most of the words have 5+ tokens to pick from.
### Get Timestamps
The transcripts on the Lingthusiasm website have all the speakers labelled, but no timestamps, and the captions on YouTube have timestamps, but few speaker labels. To find where in the episodes the token words are, we'll have to combine them.
We're going to be getting the caption data and audio from Lingthusiasm's "all episodes" [playlist](https://www.youtube.com/watch?v=xHNgepsuZ8c&list=PLcqOJ708UoXQ2wSZelLwkkHFwg424u8tG), using the [pytube package](https://pytube.io/en/latest/).
```{python}
#| label: video-list
video_list = Playlist('https://www.youtube.com/watch?v=xHNgepsuZ8c&' +
'list=PLcqOJ708UoXQ2wSZelLwkkHFwg424u8tG')
```
This function uses the YouTube playlist link to download captions for each video:
```{python}
#| label: function-get-captions
def get_captions(videos):
"""Get and format captions from YouTube URLs."""
df = pd.DataFrame() # <1>
for url in videos: # <2>
video = YouTube(url) # <2>
video.bypass_age_gate() # <3>
title = video.title # <4>
number = int(title[:2]) # <4>
caps = caption_version(video.captions) # <5>
try:
caps = parse_captions(caps.xml_captions) # <6>
except AttributeError:
caps = pd.DataFrame()
caps.insert(0, 'Episode', number) # <7>
df = pd.concat([df, caps]) # <8>
df = df[['Episode', 'Text', 'Start', 'End']] # <9>
df[['Start', 'End']] = df[['Start', 'End']].astype('float32') # <9>
return df.sort_values(['Episode', 'Start']) # <9>
```
1. Start dataframe for results.
2. Go through the list of video URLs and open each one as a `YouTube` object.
3. Need to include this to access captions.
4. The video title is an attribute of the `YouTube` object, and the episode number is the first word of the title.
5. Load captions, which returns a dictionary with the language name as the key and the captions and the values. Select which English version to use (function defined below).
6. Convert XML data to dataframe (function defined below). If this doesn't work, make an empty placeholder dataframe.
7. Add `Episode` column (integer).
8. Add results from current video to dataframe of all results.
9. Return dataframe of all captions, after organizing columns and sorting rows by episode number then time.
Now, some helper functions to select which caption version to download and then reformat the data from XML to a dataframe.
Caption data is most commonly formatted as XML or SRT, which is a text format, but not easily convertible to a dataframe.
```{python}
#| label: function-caption-utils
def caption_version(cur_captions): # <1>
"""Select which version of the captions to use (formatting varies)."""
if 'en' in cur_captions.keys():
return cur_captions['en']
elif 'en-GB' in cur_captions.keys():
return cur_captions['en-GB']
elif 'a.en' in cur_captions.keys():
return cur_captions['a.en']
def caption_times(xml):
"""Find timestamps in XML data."""
lines = pd.Series(xml.split('<p')[1:], dtype='string') # <2>
df = lines.str.extract(r'(?P<Start>t="[0-9]*")') # <3>
df['Start'] = df['Start'].str.removeprefix('t="').str.removesuffix('"') # <4>
df['Start'] = df['Start'].astype('int') # <4>
return df
def caption_text(xml):
"""Find text and filter out extra tags in XML data."""
lines = pd.Series(xml.split('<p')[1:], dtype='string') # <5>
tags = r'"[0-9]*"|t=|d=|w=|a=|ac=|</p>|</s>|<s\s|>\s|>|</body|</timedtext' # <6>
texts = lines.str.replace(tags, '', regex=True) # <7>
texts = texts.str.replace(''', '\'') \
.str.replace('"', "'") \
.str.replace('<', '[') \
.str.replace('>', ']') \
.str.replace('&', ' and ') # <8>
texts = [clean_text(line) for line in texts.to_list()] # <8>
return pd.DataFrame({'Text': texts}, dtype='string') # <9>
def parse_captions(xml):
"""Combine timestamp rows and text rows."""
times = caption_times(xml) # <10>
texts = caption_text(xml) # <10>
df = times.join(texts) # <10>
df = df[df['Text'].str.contains(r'[a-z]')] # <11>
df = df.reset_index(drop=True) # <11>
df['End'] = df['Start'].shift(-1) - 1 # <12>
return df
```
1. Prefer `en` captions, and if those aren't available, try `en-GB` then `a.en`. The main difference is in the formatting, and I think if they are user- or auto-generated.
2. The XML data is one giant string. Split into a list using the `<p` tag, which results in one string per caption item (timestamps, text, and various other tags).
3. Each caption item has a start time preceded by `t=`. This regex gets `t="[numbers]"` and puts each result into a column called `Start`.
4. Remove `t="` from the beginning of the `Start` column and `"` from the end, leaving a number that can be converted from a string to an integer.
5. The XML data is one giant string. Split into a list using the `<p` tag, which results in one string per caption item (timestamps, text, and various other tags).
6. This is a list of the other variables that come with the caption text, including `t=` for start time and `d=` for duration. The formatting varies somewhat based on which version of captions you get, and the reason this selects the `en` captions before the `en-GB` and `a.en` captions is that this formatting is simpler.
7. Take all the tags/variables out, leaving only the actual caption text.
8. Remove special characters and extra spaces.
9. Return as dataframe, with `Text` specified as a string column. If you leave it as the default object type, pandas will throw warnings later.
10. Get dataframes with caption times and caption texts, using two functions just defined. Join the results into one dataframe (this works because both are indexed by row number).
11. Remove rows where caption text is blank, then reset index (row number).
12. Make a column for the `End` time, which is the `Start` time of the new row - 1.
Running this takes a minute or so.
```{python}
#| label: get-captions
captions = get_captions(video_list) # <1>
captions = captions[captions['Episode'] <= 84] # <2>
pd.concat([captions.head(), captions.tail()]).style.format(precision=0) # <3>
```
1. Load and format captions for each video in the "all episodes" playlist, using functions defined above.
2. Only analyze episodes 1-84.
3. Show the first and last 10 rows of the resulting dataframe. `.style` prints it as a nicer-looking table.
Save results to `data/captions.csv`:
```{python}
#| label: save-captions
captions.to_csv('data/captions.csv', index=False)
```
### Match Timestamps to Transcripts
Now match the text from the transcript (target word start -/+ 25 characters) to the caption timestamps, using the [thefuzz package](https://github.com/seatgeek/thefuzz) to find the closest match.
```{python}
#| label: function-match-transcript-times
def match_transcript_times(df_trans, df_cap):
"""Use fuzzy matching to match transcript turn to caption timestamp."""
df_times = df_trans.reset_index(drop=False) # <1>
to_cat = ['Vowel', 'Word', 'Episode'] # <1>
df_times[to_cat] = df_times[to_cat].astype('category') # <1>
df_times['Text_Caption'] = pd.Series(dtype='string') # <2>
for i in df_times.index:
episode = df_times.loc[i, 'Episode'] # <3>
captions_subset = df_cap[df_cap['Episode'] == episode] # <3>
text = df_times.loc[i, 'Text_Subset'] # <4>
text = match_exceptions(text, search_exceptions) # <5>
if text is not None: # <6>
match = process.extractOne(text, captions_subset['Text']) # <6>
if match is not None: # <7>
df_times.loc[i, 'Text_Caption'] = match[0] # <7>
df_times.loc[i, 'Match_Rate'] = match[1] # <7>
df_times.loc[i, 'Start'] = captions_subset.loc[match[2], 'Start'] # <8>
df_times.loc[i, 'End'] = captions_subset.loc[match[2], 'End'] # <9>
return df_times # <10>
```
1. Make copy of the transcripts-containing-target-words dataframe, and convert `Vowel`, `Word`, and `Episode` from indices to columns, and specify those columns are categorical.
2. Make column for results and specify it is a string type. Again, this will work without this line, but will result in warnings from pandas.
3. Subset the captions dataframe to only include the episode of the current row, so we have less to search through.
4. Text from transcript (target word -/+ 25 characters) to search for in captions.
5. There are a few cases where the search text needs to be tweaked or subset to get the correct match, and a few cases where a token needs to be skipped because the audio quality wasn't good enough. These cases are defined in the next code chunk.
6. Find row in caption dataframe (subset to just include the current episode) that best matches the search text (target word -/+ 25 characters).
7. If a match is identified, a tuple is returned where the first item is the matching text, the second item is the certainty (where 100 is an exact match), and the third item is the row index (in the `Text` column of the captions dataframe, subset to just include the current episode).
8. Use the row number from the match to get the corresponding `Start` and `End` times.
9. Results include all of the target words found in the transcript, with columns added for `Text_Caption`, `Start`, and `End`.
Fuzzy string matching accounts for a lot of the differences between the transcripts (which are proofread/edited) and the captions (not all proofread/edited, have some speaker names and initials mixed in). But it wasn't always able to account for the differences in splitting sections, since the captions aren't split by speaker turns. So, there are some where matching the right timestamp requires a transcript text that's subset further (`search_text_subsets`) or tweaked (`search_text_replacements`). This isn't the most elegant solution, but it works (and it was still faster than opening up 84 30-minute audio files to find and extract out 200 3-second clips).
Then, after going through all the clips and annotating the vowels (see [Part 2](2_annotate_audio.qmd)), there are some that get skipped (`search_text_skips`) because of the audio quality (e.g., overlapping speech).
```{python}
#| label: function-match-exceptions
search_exceptions = {
'replace': { # <1>
"all to a sing": "all to a single place",
"and you have a whole room full":
"speakers who have a way of communicating with each other that",
"'bang' or something": "bang or something",
"emphasising the beat gestures in a very": "B gestures",
"by the foot and they really want":
"Yeah they rented the books by the foot and they really want",
"father's brother's wife": "father's brother's wife",
"foot/feet": "foot feet",
"is known as a": "repetition is known as a",
"an elastic thing with a ball on the end.":
"paddle thing that has an elastic thing with a ball on the end",
"pin things": "trying to use language names as a way to pin things",
"red, green, yellow, blue, brown, pink, purple":
"L: Black, white, red, green, yellow, blue, brown",
"see; red, green, yellow, blue, brown, purple, pink":
"green, yellow, blue, brown, purple, pink,",
"tied up with": "complicated or really difficult histories",
"to cite": "believe I got to",
"your father's brother, and you": "your father's brother",
"your father's brothers or your":
"your father's brothers or your mother's brothers"
},
'trim': [ # <2>
'a lot of very core', 'a more honorific one if you',
'an agreement among', 'audiobook', 'auditorily', 'ball of rocks',
'between those little balls', 'big yellow', 'bonus topics',
'bought a cot', 'bought! And it was true', 'bought some cheese',
'but let me finish', 'core idea', 'core metaphor',
'core set of imperatives', 'core thing', 'finished with their turn'
'finished yours', 'guest, Gabrielle', 'guest, Pedro',
'honorific register', 'how could this plan',
'imperative or this honorific', 'letter from her father',
'like if I bought', 'on which country you bought', 'pangrams',
'pater -- father', 'returning special guest',
'see among languages are based', 'substitute',
'that did it with the Big Bang', 'the idea is that honorifics',
'wanna hear more from our guests', 'with your elbow or your foot',
'You bet', 'you tap your foot by the treat'
],
'skip': [ # <3>
'believe you left this', 'bit of a rise', 'By the foot!',
'can pin down', 'existing thanks', 'featuring', 'fun and delightful',
'fun to flip', 'fun with adults', 'group of things', 'his data?',
'in November', 'local differences', 'people notice', 'She says no',
'situation where', 'things wrong', 'what letters', 'who are around you'
]
}
def match_exceptions(text, exceptions):
"""Check if text to match is an exception and gets replaced or subset."""
for k, v in exceptions['replace'].items(): # <4>
if k in text:
return v
for s in exceptions['trim']: # <5>
if s in text:
return s
return None if any(text in s for s in exceptions['skip']) else text # <6>
```
1. Cases where search text needs to be replaced to match correctly.
2. Cases where search text needs to be subset to match correctly.
3. Cases where search text needs to be skipped because of audio quality.
4. If a `replace` key string is a substring of search text, then return key's value.
5. If a `trim` string is a subset of search text, then return that string.
6. If a `skip` string is a subset of search text, then return `None`. If none of exceptions apply, then return original search text.
This takes a few minutes to run.
```{python}
#| label: match-transcript-to-timestamps-1
timestamps = match_transcript_times(transcripts_subset, captions) # <1>
timestamps = timestamps[timestamps['Text_Caption'].notna()] # <2>
timestamps.to_csv('data/timestamps_all.csv', index=False) # <3>
```
1. Get timestamps for selected turns (function defined in two code chunks above).
2. Most ID a match, filter rows that don't.
3. Save all results.
Now select a subset to annotate. Pick the 5 highest matches (to caption timestamps from transcript) for each word + speaker, and if there are multiple good matches, pick the ones from the later episodes since those have higher sound quality.
```{python}
# | label: match-transcript-to-timestamps-2
timestamps = timestamps \
.sort_values(['Match_Rate', 'Episode'], ascending=False) \
.groupby(['Vowel', 'Word', 'Speaker'], observed=True) \
.head(5) \
.sort_values(['Vowel', 'Word', 'Speaker']) \
.reset_index(drop=True) # <1>
timestamps['Number'] = timestamps \
.groupby(['Word', 'Speaker'], observed=True) \
.cumcount() + 1 # <2>
timestamps = timestamps[[
'Vowel', 'Word', 'Speaker', 'Number', 'Episode', 'Turn',
'Text', 'Text_Subset', 'Text_Caption', 'Start', 'End'
]] # <3>
```
1. Sort with the highest match rates and later episodes at the top. Then group by `Vowel`, `Word`, and `Speaker`. Select the 5 highest within each `Vowel` + `Word` + `Speaker` group, which gets the closest matches, and if there are multiple good matches, picks the one from the later episode (higher audio quality). Then re-sort by `Vowel`, `Word`, and `Speaker` and convert those from indices to columns.
2. Add new column for count within each `Word` + `Speaker` combination, which we'll use later to name files.
3. Sort columns.
We have 5 tokens of most words (except for the words where there weren't >= 5 tokens in the transcripts):
```{python}
#| label: count-tokens
timestamps.value_counts(['Vowel', 'Word', 'Speaker'], sort=False) \
.reset_index() \
.style \
.hide() # <1>
```
1. Count number of rows for each `Vowel` + `Word` + `Speaker` combination. Convert those from indices to columns, so this is a dataframe instead of a series, and thus can be printed as a table by making it a `style` object. `.hide()` drops the row numbers.
Save results to `data/timestamps.csv`:
```{python}
#| label: save-timestamps
timestamps.to_csv('data/timestamps_annotate.csv', index=False)
```