2_annotate_audio.qmd

---
title: "Part 2: Annotating the Audio"
number-offset: [0, 2, 0]
---

<br>

There are two sets of data going into these vowel plots:

1. Vowels pulled from the Lingthusiasm episode recordings, which were located in [Part 1](1_find_words.qmd)
2. Vowels from Gretchen & Lauren recording the Wells lexical set for me

The next steps are to trim the words out of the episode audio files for #1, then annotate the vowels for both #1 and #2.

### Setup

```{python}
#| label: imports

"""Part 2 of Lingthusiasm Vowel Plots: Trimming Audio and Getting Vowel Formants."""

import glob  # <1>
import os  # <1>
import pandas as pd  # <2>
from pytube import Playlist, YouTube  # <3>
from pydub import AudioSegment  # <4>
import parselmouth  # <5>
```

1. File utilities.
2. Dataframes.
3. Getting captions and audio data from YouTube.
4. Working with audio files.
5. Interface with Praat.

Get video info from Lingthusiasm's all episodes [playlist](https://www.youtube.com/watch?v=xHNgepsuZ8c&list=PLcqOJ708UoXQ2wSZelLwkkHFwg424u8tG):

```{python}
#| label: video-list

video_list = Playlist('https://www.youtube.com/watch?v=xHNgepsuZ8c&' +
                      'list=PLcqOJ708UoXQ2wSZelLwkkHFwg424u8tG')
```

Go through each video and download audio (if not already downloaded):

```{python}
#| label: download-audio

def get_audio(videos):
    """Download episode audio from Youtube."""
    for url in videos:
        video = YouTube(url)  # <1>
        video.bypass_age_gate()  # <2>

        title = video.title  # <3>
        episode = int(title[:2])  # <3>

        audio_file_name = os.path.join(  # <4>
            'audio', 'episodes', f'{episode}.mp4')
        if not os.path.isfile(audio_file_name):  # <5>
            audio_stream = video.streams.get_audio_only()  # <5>
            print(f'downloading {episode}')  # <5>
            audio_stream.download(filename=audio_file_name)  # <5>


get_audio(video_list)
```

1. Go through the list of video URLs and open each one as a `YouTube` object.
2. Need to include this to download data.
3. The video title is an attribute of the `YouTube` object, and the episode number is the first word of the title.
4. Create file name for episode audio.
5. If file is not already downloaded, select and download the highest-quality audio-only stream.

### Trim Audio from Episodes

Open the `timestamps` data from Part 1:

```{python}
#| label: open-timestamps

timestamps = pd.read_csv( 
    'data/timestamps_annotate.csv',
    usecols=[
        'Vowel', 'Word', 'Speaker', 'Number',  # <1>
        'Episode', 'Start', 'End'  # <2>
    ],
    dtype={  # <3>
        'Vowel': 'category', 'Word': 'category', 'Speaker': 'category',  # <3>
        'Number': 'category', 'Episode': 'category',  # <3>
        'Start': 'int', 'End': 'int'  # <3>
    }  # <3>
)
timestamps['Speaker'] = timestamps['Speaker'] \
    .str.replace('retchen', '').str.replace('auren', '') \
    .astype('category')  # <4>
```

1. Keep columns specifying word variables.
2. And keep columns specifying where audio is.
3. Make all columns categorical variables, except the `Start` and `End` times (integers).
4. Convert values in `Speaker` column from names to initials.

Trim audio for the duration of the caption (with 250ms before and after). This results ~240 audio files each 2-10sec long, each containing a target word.

```{python}
#| label: trim-audio

def trim_audio(df):
    """Use caption timestamps to trim audio."""

    for i in df.index:  # <1>
        episode = df.loc[i, 'Episode']  # <1>
        word = df.loc[i, 'Word']  # <1>
        speaker = df.loc[i, 'Speaker']  # <1>
        count = df.loc[i, 'Number']  # <1>

        out_file =  os.path.join(  # <2>
            'audio', 'words', f'episode_{word}_{speaker}_{count}.wav')
        if not os.path.isfile(out_file):  # <2>
            in_file = os.path.join('audio', 'episodes', f'{episode}.mp4')  # <3>
            audio = AudioSegment.from_file(in_file, format='mp4')  # <3>
            start = max(df.loc[i, 'Start'] - 250, 0)  # <4>
            end = min(len(audio), df.loc[i, 'End'] + 250)  # <4>
            clip = audio[start:end]  # <4>
            clip.export(out_f=out_file, format='wav')  # <4>


trim_audio(timestamps)
```

1. Go through dataframe that has the example words to annotate and their timestamps.
2. Make file name for current word, and if it does not already exist...
3. Open the audio file for the whole episode.
4. Trim the episode audio to start 250 ms after the caption timestamp and end 250 after the caption timestamp; save it.

### Wells Lexical Set

The Wells lexical set is a set of examples for each vowel/diphthong, chosen to be maximally distinguishable and consistent. You can read more about it on [Wikipedia](https://en.wikipedia.org/wiki/Lexical_set#Standard_lexical_sets_for_English) and [John Wells' blog](https://phonetic-blog.blogspot.com/2010/02/lexical-sets.html). These recordings are going to be more controlled than the vowels pulled from the episode recordings and easier to annotate because they're spoken more slowly and carefully. This set contains some fairly low-frequency words, which is why there's not a lot of overlap with the words pulled from the episodes.

[![The Wells lexical set](resources/wells_lexical_set.jpg)](http://2.bp.blogspot.com/_RSOXNV65lN0/S2a13vcLBAI/AAAAAAAAAYg/RQo2sbM7cqM/s1600-h/sets.jpg)

```{python}
#| label: list-wells-lexical-set

wells_lexical_set = {  # <1>
    '\u0069': 'fleece',   # i
    '\u026A': ['kit', 'near'],  # ɪ
    '\u025B': ['dress', 'square'],  # ɛ
    '\u00E6': ['trap', 'bath'],  # æ
    '\u006F': ['force', 'goat'], # o 
    '\u0075': 'goose',  # u
    '\u028A': ['cure', 'foot'],  # ʊ
    '\u0254': ['cloth', 'north', 'thought'],  # ɔ
    '\u0251': ['lot', 'palm', 'start'],  # ɑ
    '\u028C': 'strut'  # ʌ
}

wells_lexical_set = pd.DataFrame.from_dict(wells_lexical_set, orient='index') \
    .rename(columns={0: 'Word'}) \
    .explode('Word') \
    .reset_index(names='Vowel')  # <2>
```

1. Dictionary where keys are the IPA vowel unicode and values is word(s).
2. Convert to dataframe with columns for `Vowel` and `Word`.

Here's the full set of words for each vowel:

```{python}
#| label: all-words-list

word_list = pd.concat([  # <1>
    pd.DataFrame({  # <2>
        'List': 'lexicalset',  # <2>
        'Vowel': wells_lexical_set['Vowel'],  # <2> 
        'Word': wells_lexical_set['Word']  # <2>
    }),  # <2>
    timestamps[['Vowel', 'Word']].drop_duplicates()  # <3>
  ])
word_list = word_list.fillna('episode')  # <4>
word_list = word_list.sort_values(by = ['Vowel', 'Word'])  # <5>
word_list = word_list.reset_index(drop = True)  # <5>

word_list.style.hide()  # <6>
```

1. Combine word lists from episodes and Wells lexical set.
2. Dataframe for Wells lexical set with columns for `List`, `Vowel`, and `Word`.
3. Subset dataframe for episode word list, with columns for `Vowel` and `List`.
4. Fill the `NA` values of `List` with `episode`.
5. Sort and reset index.
6. Print, not including index.

### An Interlude in Praat

Now, we have about 400 audio clips of Gretchen and Lauren saying words that are good examples of each vowel. The vowel data that will actually be going into the plots is the lowest two formants: F1 and F2 (see the bonus episode about vowel plots and the main episode about vowels for an explanation of what formants are). The easiest way to calculate the vowel formants is using Praat, a software designed for doing phonetic analysis.

Here's an example of what that looked like:

![Praat screenshot of Lauren saying "pit."](resources/praat_screenshot.png)

The vowel [ɪ] is highlighted in pink, and you can see it's darker on the spectrogram than the consonants [k] and [t] before and after it. I placed an annotation (the blue lines, which Praat calls a "point tier") right in the middle of the vowel sound---this one is pretty easy, because Lauren was speaking slowly and without anyone overlapping, so the vowel sound is long and clear.
The formants are the lines of red dots, and the popup window is the formant values at the vowel annotation time. We'll be using F1 (the bottom one) and F2 (the second from the bottom).

You can download Praat and see the documentation [here](https://www.fon.hum.uva.nl/praat/). It's fairly old, so a con is that the interface isn't necessarily intuitive if you're used to modern programs, but a pro is that there are a ton of resources available for learning how to use it. [Here's a tutorial](https://aletheiacui.github.io/tutorials/segmentation_with_praat.html) about getting started with Praat, and [here's one](https://home.cc.umanitoba.ca/~krussll/phonetics/practice/praat.html) for recording your own audio and calculating the formants in it.

After going through all of the audio clips, I had a .TextGrid file (Praat's annotation file format) for each audio clip that includes the timestamp for middle(ish) of the vowel. You can copy formant values manually out of Praat, or you can use Praat scripting to export them to a csv file (see [this  tutorial](https://joeystanley.com/blog/a-tutorial-on-extracting-formants-in-praat/), for example). But I prefer to go back to Python instead of wrangling Praat scripting code.

### Read Annotation Data

There are packages that read Praat TextGrid files, but I kept getting errors. Luckily, the textgrids for these annotations are simple text files, where we only need to extra one variable (the time of the point tier). These two functions do that:

```{python}
#| label: functions-textgrids-1

def get_tier(text):
    """Get annotation info from Praat TextGrid."""

    tg = text.split('class = "TextTier"')[1]  # <1>
    tg = tg.splitlines()[1:]  # <2>
    tg = pd.Series(tg)  # <2>
    tg = tg.str.partition('=')  # <3>
    tg.drop(columns=1, inplace=True)  # <3>
    tg.rename(columns={0: 'Name', 2: 'Value'}, inplace=True)  # <3>
    tg['Name'] = tg['Name'].str.strip()  # <4>
    tg['Value'] = tg['Value'].str.strip()  # <4>
    tg.set_index('Name', inplace=True)  # <5>

    return tg['Value'].to_dict()  # <5>


def get_point_tier_time(t):
    """Get time from TextGrid PointTier."""

    tg = get_tier(t)  # <6>
    time = tg['number']  # <7>
    time = float(time)  # <8>

    return round(time, 4)  # <8>
```

1. Section we need in TextGrid files start with this string.
2. Split string by line breaks and convert to pandas series.
3. Split into columns by `=` character, where the first column is the variable name, the second column is `=` (and gets dropped), and the third column is the variable value.
4. Remove extra whitespace.
5. Make `Name` into index, so the dataframe can be converted to a dictionary.
6. Read TextGrid file using function defined immediately above.
7. The variable we want (the timestamp for the PointTier annotation) is called `number`.
8. Convert the time from character to numeric and round to 4 digits.


These functions cycle through the list of TextGrid files, extract the point tier times, and put them into a dataframe with the rest of the information for each word:

```{python}
#| label: functions-textgrids-2

def read_textgrid_times(file_list, word_list):
    """Read textgrid files into dataframe."""

    tg_times = []
    for file_name in file_list:
        with open(file_name, encoding='utf-8') as t_file:
            t = t_file.read()
            try:
                tg = get_point_tier_time(t)  # <1>
            except KeyError:
                tg = None
            tg_times.append(tg)

    df = pd.DataFrame({'File': file_list, 'Vowel_Time': tg_times})  # <2>
    df['File'] = df['File'].str.rpartition('\\')[2]  # <3>
    df['File'] = df['File'].str.removesuffix('.TextGrid')  # <3>

    return textgrid_vars(df, word_list)  # <4>


def textgrid_vars(df, word_list):
    """Format df of vowel timestamps."""

    df['List'] = df['File'].str.split('_', expand=True)[0]  # <5>
    df['Word'] = df['File'].str.split('_', expand=True)[1]  # <5>
    df['Speaker'] = df['File'].str.split('_', expand=True)[2]  # <5>
    df['Count'] = df['File'].str.split('_', expand=True)[3]  # <5>

    df = pd.merge(df, word_list, how='left', on=['Word', 'List'])  # <6>

    return df[['List', 'Vowel', 'Word', 'Speaker', 'Count', 'Vowel_Time']]  # <7>
```

1. Try to get timestamp from PointTier annotation for each TextGrid file, using function defined in previous code chunk.
2. Put results into a dataframe.
3. Remove the path prefix and type suffix from the file names, leaving just the `word_speaker_number` format.
4. Get other variables from `File`, using function defined immediately below.
5. When `File` is split by `_`, `List` (`episode` or `lexicalset`) is the first item, `Word` is the second item, `Speaker` (`G` or `L`) is the third item, and `Count` is the fourth item in the resulting list.
6. Merge with the word list dataframe to add a column for `Vowel` by matching on `Word`.
7. Organize.

Make list of TextGrid files and read PointTier times:

```{python}
#| label: read-textgrid-times

tg_list = glob.glob(os.path.join('audio', 'words', '*.TextGrid'))  # <1>
formants = read_textgrid_times(tg_list, word_list)  # <2>

pd.concat([formants.head(), formants.tail()]).style.hide()  # <3>
```

1. Get a list of all `.TextGrid` files in the `audio/words/` directory.
2. Get the data from each TextGrid file, using functions defined in previous two code chunks.
3. Get the first and last 10 rows, then display as a table, not including row numbers.

### Calculate Formants

It's possible to export the formants from Praat, but I think it's easier to use the [parselmouth package](https://parselmouth.readthedocs.io/en/stable/index.html) here, which runs Praat from Python.

```{python}
#| label: function-calculate-formants

def get_formants(df):
    """Get F1 and F2 at specified time."""

    for i in df.index:
        file = os.path.join(  # <1>
            'audio', 'words',  # <1>
            (df.loc[i, 'List'] + '_' + df.loc[i, 'Word'] + '_' +  # <1>
            df.loc[i, 'Speaker'] + '_' + df.loc[i, 'Count'] + '.wav')  # <1>
        )
        audio = parselmouth.Sound(file)  # <2>
        formants = audio.to_formant_burg(  # <3>
            time_step=0.01, max_number_of_formants=5  # <3>
        )  # <3>
        if 'fleece_L_2' in file:  # <4>
            formants = audio.to_formant_burg(  # <4>
                time_step=0.01, max_number_of_formants=4  # <4>
            )  # <4>
        vowel_time = df.loc[i, 'Vowel_Time']  # <5>
        df.loc[i, 'F1'] = formants.get_value_at_time(1, vowel_time)  # <6>
        df.loc[i, 'F2'] = formants.get_value_at_time(2, vowel_time)  # <6>

    return df
```

1. Reconstruct the current audio file name from `Word` + `Speaker` + `Count` variables.
2. Use the parselmouth package to open the audio file.
3. Call Praat via parselmouth and calculate the formants. These are the default settings: every 0.010 seconds, up to 5 formants.
4. Because of artefacts in the recording, this file needs a limit of 4 formants to identify them correctly.
5. Get the timestamp of the vowel for the current audio file.
6. Get F1 and F2 at the specified time from the parselmouth formant object.

Calculate the formants and summarize the results:

```{python}
#| label: calculate-formants

formants = get_formants(formants)  # <1>

pd.concat([formants.head(), formants.tail()]).style.hide()  # <2>
```

1. Calculate vowel formants for each word, using function defined in previous code chunk.
2. Get the first and last 10 rows, then display as a table, not including row numbers.

```{python}
#| label: formant-summary

formants.groupby(['Vowel', 'Speaker']) \
    .agg({'F1': ['min', 'mean', 'max'], 'F2': ['min', 'mean', 'max']}) \
    .style \
    .format(precision=0)  # <1>
```

1. Group the data by `Vowel` and `Speaker`, then calculate the min, mean, and max of `F1` and `F2` for each `Vowel` + `Speaker` combination. Print the results as a table, with values rounded to whole numbers.

Save results as `data/formants.csv`:

```{python}
#| label: save-formant-results

formants.to_csv('data/formants.csv', index=False)
```