mozilla-ai · daavoo · Dec 27, 2024 · Dec 27, 2024 · Dec 27, 2024 · Dec 27, 2024
diff --git a/README.md b/README.md
@@ -10,6 +10,8 @@
 This blueprint demonstrate how you can use open-source models & tools to convert input documents into a podcast featuring two speakers.
 It is designed to work on most local setups or with [GitHub Codespaces](https://github.com/codespaces/new?hide_repo_select=true&ref=main&repo=888426876&skip_quickstart=true&machine=standardLinux32gb), meaning no external API calls or GPU access is required. This makes it more accessible and privacy-friendly by keeping everything local.
 
+<p align="center"><a href="https://colab.research.google.com/github/mozilla-ai/document-to-podcast/blob/main/demo/notebook.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a></p>
+
 ### 👉 📖 For more detailed guidance on using this project, please visit our [Docs here](https://mozilla-ai.github.io/document-to-podcast/).
 
 ### Built with

diff --git a/demo/notebook.ipynb b/demo/notebook.ipynb
@@ -0,0 +1,372 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Document to Podcast"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Source code: https://github.com/mozilla-ai/document-to-podcast"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Docs: https://mozilla-ai.github.io/document-to-podcast/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebooks goes through the process of transforming documents into engaging podcast episodes involves an integration of pre-processing, LLM-powered transcript generation, and text-to-speech generation.\n",
+    "\n",
+    "For educational purposes, the \"low level\" API is used.\n",
+    "\n",
+    "You can check the [Command Line Interface](https://mozilla-ai.github.io/document-to-podcast/cli/) for a simpler usage."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Installing dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install --quiet https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp310-cp310-linux_x86_64.whl\n",
+    "%pip install --quiet document-to-podcast"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Uploading data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from google.colab import files\n",
+    "\n",
+    "uploaded = files.upload()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading and cleaning data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Docs for this Step](https://mozilla-ai.github.io/document-to-podcast/step-by-step-guide/#step-1-document-pre-processing)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "from document_to_podcast.preprocessing import DATA_CLEANERS, DATA_LOADERS\n",
+    "\n",
+    "input_file = list(uploaded.keys())[0]\n",
+    "suffix = Path(input_file).suffix\n",
+    "\n",
+    "data_loader = DATA_LOADERS[suffix]\n",
+    "data_cleaner = DATA_CLEANERS[suffix]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "raw_text = data_loader(input_file)\n",
+    "print(f\"Number of characters before cleaning: {len(raw_text)}\")\n",
+    "print(raw_text[:200])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "clean_text = data_cleaner(raw_text)\n",
+    "print(f\"Number of characters after cleaning: {len(clean_text)}\")\n",
+    "print(clean_text[:200])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Downloading and loading models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Docs for this Step](https://mozilla-ai.github.io/document-to-podcast/step-by-step-guide/#step-2-podcast-script-generation)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For this demo, we are using the following models:\n",
+    "  - [OLMoE-1B-7B-0924-Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct-GGUF)\n",
+    "  - [OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf](https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can check the [Customization Guide](https://mozilla-ai.github.io/document-to-podcast/customization/) for more information on how to use different models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from document_to_podcast.inference.model_loaders import (\n",
+    "    load_llama_cpp_model,\n",
+    "    load_outetts_model,\n",
+    ")\n",
+    "\n",
+    "text_model = load_llama_cpp_model(\n",
+    "    \"allenai/OLMoE-1B-7B-0924-Instruct-GGUF/olmoe-1b-7b-0924-instruct-q8_0.gguf\"\n",
+    ")\n",
+    "speech_model = load_outetts_model(\n",
+    "    \"OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "max_characters = text_model.n_ctx() * 4\n",
+    "if len(clean_text) > max_characters:\n",
+    "    print(\n",
+    "        f\"Input text is too big ({len(clean_text)}).\"\n",
+    "        f\" Using only a subset of it ({max_characters}).\"\n",
+    "    )\n",
+    "    clean_text = clean_text[:max_characters]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Podcast generation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Docs for this Step](https://mozilla-ai.github.io/document-to-podcast/step-by-step-guide/#step-3-audio-podcast-generation)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Speaker configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from document_to_podcast.config import Speaker\n",
+    "\n",
+    "speakers = [\n",
+    "    {\n",
+    "        \"id\": 1,\n",
+    "        \"name\": \"Laura\",\n",
+    "        \"description\": \"The main host. She explains topics clearly using anecdotes and analogies, teaching in an engaging and captivating way.\",\n",
+    "        \"voice_profile\": \"female_1\",\n",
+    "    },\n",
+    "    {\n",
+    "        \"id\": 2,\n",
+    "        \"name\": \"Jon\",\n",
+    "        \"description\": \"The co-host. He keeps the conversation on track, asks curious follow-up questions, and reacts with excitement or confusion, often using interjections like hmm or umm.\",\n",
+    "        \"voice_profile\": \"male_1\",\n",
+    "    },\n",
+    "]\n",
+    "\n",
+    "speakers_str = \"\\n\".join(\n",
+    "    str(Speaker.model_validate(speaker))\n",
+    "    for speaker in speakers\n",
+    "    if all(speaker.get(x, None) for x in [\"name\", \"description\", \"voice_profile\"])\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Prompt Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "PROMPT = \"\"\"\n",
+    "You are a podcast scriptwriter generating engaging and natural-sounding conversations in JSON format.\n",
+    "The script features the following speakers:\n",
+    "{SPEAKERS}\n",
+    "Instructions:\n",
+    "- Write dynamic, easy-to-follow dialogue.\n",
+    "- Include natural interruptions and interjections.\n",
+    "- Avoid repetitive phrasing between speakers.\n",
+    "- Format output as a JSON conversation.\n",
+    "Example:\n",
+    "{\n",
+    "  \"Speaker 1\": \"Welcome to our podcast! Today, we're exploring...\",\n",
+    "  \"Speaker 2\": \"Hi! I'm excited to hear about this. Can you explain...\",\n",
+    "  \"Speaker 1\": \"Sure! Imagine it like this...\",\n",
+    "  \"Speaker 2\": \"Oh, that's cool! But how does...\"\n",
+    "}\n",
+    "\"\"\"\n",
+    "system_prompt = PROMPT.replace(\"{SPEAKERS}\", speakers_str)\n",
+    "print(system_prompt)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Model inference"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "\n",
+    "from document_to_podcast.inference.text_to_speech import text_to_speech\n",
+    "from document_to_podcast.inference.text_to_text import text_to_text_stream\n",
+    "from IPython.display import display, Audio\n",
+    "\n",
+    "podcast_audio = []\n",
+    "podcast_script = \"\"\n",
+    "text = \"\"\n",
+    "for chunk in text_to_text_stream(\n",
+    "    clean_text, text_model, system_prompt=system_prompt.strip()\n",
+    "):\n",
+    "    text += chunk\n",
+    "    if text.endswith(\"\\n\") and \"Speaker\" in text:\n",
+    "        podcast_script += text\n",
+    "        print(text)\n",
+    "\n",
+    "        speaker_id = re.search(r\"Speaker (\\d+)\", text).group(1)\n",
+    "        voice_profile = next(\n",
+    "            speaker[\"voice_profile\"]\n",
+    "            for speaker in speakers\n",
+    "            if speaker[\"id\"] == int(speaker_id)\n",
+    "        )\n",
+    "        speech = text_to_speech(\n",
+    "            text.split(f'\"Speaker {speaker_id}\":')[-1],\n",
+    "            speech_model,\n",
+    "            voice_profile,\n",
+    "        )\n",
+    "        podcast_audio.append(speech)\n",
+    "        display(Audio(speech, rate=speech_model.audio_codec.sr))\n",
+    "        text = \"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Save the results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can download the results from the file explorer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"podcast.txt\", \"w\") as f:\n",
+    "    f.write(podcast_script)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import soundfile as sf\n",
+    "\n",
+    "sf.write(\n",
+    "    \"podcast.wav\",\n",
+    "    np.concatenate(podcast_audio),\n",
+    "    samplerate=speech_model.audio_codec.sr,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}