podcast-transcriber

A set-it-and-forget-it GCP Cloud Function for transcribing a podcast, built for the Arms Control Wonk Podcast Slack community with transcriptions by Deepgram.

Overview

main.py defines a Cloud Function that

Waits on an invocation from a Pub/Sub topic;
Fetches a podcast's RSS or Atom feed of episodes;
Selects up to three most recent episodes for which it hasn't already produced transcripts;
Submits those podcast episodes to Deepgram's automated speech recognition API for transcription;
Writes the Deepgram response and a processed transcript to Google Cloud Storage.

That Cloud Function is designed to be invoked on a regular schedule; the setup instructions below and Makefile provide cron-like invocations by using Cloud Scheduler to publish to the Pub/Sub topic.

It avoids retranscribing episodes by checking whether a transcription artifact matching the episode's feed URI exists in GCS.

Usage

Requirements

gsutil and gcloud
Python 3.7
Deepgram account

Setup

Recommended: start with an empty GCP project for the resources created here (a Cloud Storage bucket, Pub/Sub topic, Cloud Scheduler job, and Cloud Function). Run gcloud config set project your-project-name to point the Google Cloud SDK tools at that project.

Create a Google Cloud Storage bucket.

Run make bucket.
Create .env.yaml file with required environment variables.

TARGET_FEED_URL : The URL of a podcast's Atom/RSS feed, e.g. "https://armscontrolwonk.libsyn.com/rss".

TRANSCRIPTIONS_BUCKET_NAME : The name of the Google Cloud Storage bucket created in step 1.

DEEPGRAM_API_KEY : Your personal Deepgram API key. This is a secret!
Example .env.yaml file
```
# Configuration
TARGET_FEED_URL: "https://armscontrolwonk.libsyn.com/rss"
TRANSCRIPTIONS_BUCKET_NAME: "transcriptions"

# Secrets
DEEPGRAM_API_KEY: "your_deepgram_secret_here"
```
Initialize the scheduling infrastructure (Pub/Sub topic and Cloud Scheduler job; documentation).

Run make cron-job.
Deploy the Cloud Function.

Run make deploy.

To work through the backlog of episodes in the feed, repeatedly run the created job: gcloud scheduler jobs run WeeklyJob.

Customization

Scheduling

Set up a transcription schedule appropriate to your podcast; these default settings are well-configured for a podcast with up to three episodes per week.

To check for updates more or less frequently, change the Cloud Scheduler job frequency. Default: once per week.

To transcribe more or fewer episodes per function execution, change the threshold in main.py#_main. Default: up to 3 episodes.

Deepgram and transcripts

The Deepgram request in main.py#_transcribe is tailored to the Arms Control Wonk podcast; if the speech in your podcast is faster or slower, you may want to decrease or increase the utt_split utterance threshold, respectively.

See Deepgram's API documentation and Python SDK for a documentation of the available options.

Want transcripts in a different format? Change how main.py#_process formats utterances.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

podcast-transcriber

Overview

Usage

Requirements

Setup

Customization

Scheduling

Deepgram and transcripts

Files

README.md

Latest commit

History

README.md

File metadata and controls

podcast-transcriber

Overview

Usage

Requirements

Setup

Customization

Scheduling

Deepgram and transcripts