Merge pull request #175 from souzatharsis/feat/longform

Feat/longform
souzatharsis · Nov 13, 2024 · 20085dd · 20085dd
2 parents 83edfd8 + b0a33cf
commit 20085dd
Show file tree

Hide file tree

Showing 17 changed files with 1,374 additions and 257 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,14 @@
 # Changelog
 
+## [0.3.6] - 2024-11-13
+
+### Added
+- Add longform podcast generation support
+  - Users can now generate longer podcasts (20-30+ minutes) using the `--longform` flag in CLI or `longform=True` in Python API
+  - Implements "Content Chunking with Contextual Linking" technique for coherent long-form content
+  - Configurable via `max_num_chunks` and `min_chunk_size` parameters in conversation config
+  - `word_count` parameter removed from conversation config as it's no longer used
+
 ## [0.3.3] - 2024-11-08
 
 ### Breaking Changes

diff --git a/README.md b/README.md
@@ -24,11 +24,9 @@ https://github.com/user-attachments/assets/f1559e70-9cf9-4576-b48b-87e7dad1dd0b
 ![GitHub Repo stars](https://img.shields.io/github/stars/souzatharsis/podcastfy)
 </div>
 
+Podcastfy is an open-source Python package that transforms multi-modal content (text, images) into engaging, multi-lingual audio conversations using GenAI. Input content includes websites, PDFs, images, YouTube videos, as well as user provided topics.
 
-
-Podcastfy is an open-source Python package that transforms multi-modal content (text, images) into engaging, multi-lingual audio conversations using GenAI. Input content includes websites, PDFs, YouTube videos, as well as images.
-
-Unlike UI-based tools focused primarily on note-taking or research synthesis (e.g. NotebookLM ❤️), Podcastfy focuses on the programmatic and bespoke generation of engaging, conversational transcripts and audio from a multitude of multi-modal sources, enabling customization and scale.
+Unlike closed-source UI-based tools focused primarily on research synthesis (e.g. NotebookLM ❤️), Podcastfy focuses on open source, programmatic and bespoke generation of engaging, conversational content from a multitude of multi-modal sources, enabling customization and scale.
 
 [![Star History Chart](https://api.star-history.com/svg?repos=souzatharsis/podcastfy&type=Date&theme=dark)](https://api.star-history.com/svg?repos=souzatharsis/podcastfy&type=Date&theme=dark)
 
@@ -61,16 +59,26 @@ This sample collection is also [available at audio.com](https://audio.com/thatup
 ## Features ✨
 
 - Generate conversational content from multiple sources and formats (images, websites, YouTube, and PDFs).
-- Customize transcript and audio generation (e.g., style, language, structure, length).
+- Generate shorts (2-5 minutes) or longform (30+ minutes) podcasts.
+- Customize transcript and audio generation (e.g., style, language, structure).
 - Generate transcripts using 100+ LLM models (OpenAI, Anthropic, Google etc).
 - Leverage local LLMs for transcript generation for increased privacy and control.
 - Integrate with advanced text-to-speech models (OpenAI, Google, ElevenLabs, and Microsoft Edge).
 - Provide multi-language support for global content creation.
 - Integrate seamlessly with CLI and Python packages for automated workflows.
 
+## Built with Podcastfy 🚀
+
+- [OpenNotebook](https://www.open-notebook.ai/)
+- [SurfSense](https://www.surfsense.net/)
+- [Podcast-llm](https://github.com/evandempsey/podcast-llm)
+- [Podcastfy-HuggingFace App](https://huggingface.co/spaces/thatupiso/Podcastfy.ai_demo)
+- [Podcastfy-UI](https://github.com/giulioco/podcastfy-ui)
+
 ## Updates 🚀
 
 ### v0.3.0+ release
+- Generate longform podcasts
 - Generate podcasts from input topic using real-time internet search
 - Integrate with 100+ LLM models (OpenAI, Anthropic, Google etc) for transcript generation
 - Integrate with Google's Multispeaker TTS model for high-quality audio generation
@@ -121,11 +129,6 @@ Podcastfy offers a range of customization options to tailor your AI-generated po
 - Choose to run [Local LLMs](usage/local_llm.md) (156+ HuggingFace models)
 - Set [System Settings](usage/config_custom.md) (e.g. output directory settings)
 
-## Built with Podcastfy 🛠️
-
-- [OpenNotebook](www.open-notebook.ai)
-- [Podcastfy-UI](https://github.com/giulioco/podcastfy-ui)
-- [Podcastfy-Gradio App](https://huggingface.co/spaces/thatupiso/Podcastfy.ai_demo)
 
 ## License
 

diff --git a/TESTIMONIALS.md b/TESTIMONIALS.md
@@ -1,2 +1,3 @@
 - "Love that you casually built an open source version of the most popular product Google built in the last decade"
+- "Your library was very straightforward to work with. You did Amazing work brother 🙏"
 - "I think it's awesome that you were inspired/recognize how hard it is to beat NotebookLM's quality, but you did an *incredible* job with this! It sounds incredible, and it's open-source! Thank you for being amazing!"
diff --git a/podcastfy.ipynb b/podcastfy.ipynb
@@ -5,7 +5,22 @@
    "metadata": {},
    "source": [
     "# Podcastfy \n",
-    "Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI"
+    "Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Features\n",
+    "\n",
+    "- Support multiple input sources (text, images, websites, YouTube, and PDFs).\n",
+    "- Generate shorts (2-5 minutes) or longform (30+ minutes) podcasts.\n",
+    "- Customize transcript and audio generation (e.g., style, language, structure).\n",
+    "- Generate transcripts using 100+ LLM models (OpenAI, Anthropic, Google etc).\n",
+    "- Leverage local LLMs for transcript generation for increased privacy and control.\n",
+    "- Integrate with advanced text-to-speech models (OpenAI, Google, ElevenLabs, and Microsoft Edge).\n",
+    "- Provide multi-language support for global content creation."
    ]
   },
   {
@@ -19,6 +34,7 @@
     "- Generate a podcast from text content\n",
     "  - Single URL\n",
     "  - Multiple URLs\n",
+    "  - Generate longform podcasts\n",
     "  - Generate transcript only\n",
     "  - Generate audio from transcript\n",
     "  - Processing PDFs\n",
@@ -124,8 +140,7 @@
    ],
    "source": [
     "from podcastfy.client import generate_podcast\n",
-    "audio_file = generate_podcast(urls=[\"https://abcnews.go.com/US/water-frost-detected-mars-volcanoes-significant-discovery-study/story?id=110993572\"], \n",
-    "                              transcript_only=True)"
+    "audio_file = generate_podcast(urls=[\"https://abcnews.go.com/US/water-frost-detected-mars-volcanoes-significant-discovery-study/story?id=110993572\"])"
    ]
   },
   {
@@ -271,6 +286,50 @@
     "However, this particular transcript did not pickup on my Podcastify's content solely focusing on the youtube video. This may happen as the AI-Podcast hosts may pick a particular concept from one of the provided sources and develop a conversation around that. There is room for improvement in guiding the AI-Podcasts hosts to strike a good balance of content coverage among the provided input sources."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Generate longform podcasts\n",
+    "\n",
+    "\n",
+    "By default, Podcastfy generates shortform podcasts. However, users can generate longform podcasts by setting the `longform` parameter to `True`.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "generate_podcast(urls=[\"<website>\"], longform=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "LLMs have a limited ability to output long text responses. Most LLMs have a `max_output_tokens` of around 4096 and 8192 tokens. Hence, long-form podcast transcript generation is challeging. We have implemented a technique I call \"Content Chunking with Contextual Linking\" to enable long-form podcast generation by breaking down the input content into smaller chunks and generating a conversation for each chunk while ensuring the combined transcript is coherent and linked to the original input.\n",
+    "\n",
+    "By default, shortform podcasts (default configuration) generate audio of about 2-5 minutes while longform podcasts may reach 20-30 minutes.\n",
+    "\n",
+    "### Adjusting longform podcast length\n",
+    "\n",
+    "Users may adjust lonform podcast length by setting the following parameters in your customization params (see later section \"Conversation Customization\"):\n",
+    "- `max_num_chunks` (default: 7): Sets maximum number of rounds of discussions.\n",
+    "- `min_chunk_size` (default: 600): Sets minimum number of characters to generate a round of discussion.\n",
+    "\n",
+    "A \"round of discussion\" is the output transcript obtained from a single LLM call. The higher the `max_num_chunks` and the lower the `min_chunk_size`, the longer the generated podcast will be.\n",
+    "Today, this technique allows the user to generate long-form podcasts of any length if input content is long enough. However, the conversation quality may decrease and its length may converge to a maximum if `max_num_chunks`/`min_chunk_size` is to high/low particularly if input content length is limited.\n",
+    "\n",
+    "Current implementation limitations:\n",
+    "- Images are not yet supported for longform podcast generation.\n",
+    "- Base LLM model is fixed to Gemini\n",
+    "\n",
+    "Above limitations are somewhat easily fixable however we chose to make updates in smaller but quick iterations rather than making all-in changes.\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/podcastfy/__init__.py b/podcastfy/__init__.py
@@ -1,2 +1,2 @@
 # This file can be left empty for now
-__version__ = "0.3.3"  # or whatever version you're on
+__version__ = "0.3.5"  # or whatever version you're on
diff --git a/podcastfy/client.py b/podcastfy/client.py
@@ -40,6 +40,7 @@ def process_content(
     model_name: Optional[str] = None,
     api_key_label: Optional[str] = None,
     topic: Optional[str] = None,
+    longform: bool = False
 ):
     """
     Process URLs, a transcript file, image paths, or raw text to generate a podcast or transcript.
@@ -54,7 +55,6 @@ def process_content(
         # Update with provided config if any
         if conversation_config:
             conv_config.configure(conversation_config)
-
         # Get output directories from conversation config
         tts_config = conv_config.get("text_to_speech", {})
         output_directories = tts_config.get("output_directories", {})
@@ -64,21 +64,29 @@ def process_content(
             with open(transcript_file, "r") as file:
                 qa_content = file.read()
         else:
+            # Initialize content_extractor if needed
+            content_extractor = None
+            if urls or topic or (text and longform and len(text.strip()) < 100):
+                content_extractor = ContentExtractor()
+
             content_generator = ContentGenerator(
                 api_key=config.GEMINI_API_KEY, conversation_config=conv_config.to_dict()
             )
 
             combined_content = ""
-            if urls or topic:
-                content_extractor = ContentExtractor()
-
+
             if urls:
                 logger.info(f"Processing {len(urls)} links")
                 contents = [content_extractor.extract_content(link) for link in urls]
                 combined_content += "\n\n".join(contents)
 
             if text:
-                combined_content += f"\n\n{text}"
+                if longform and len(text.strip()) < 100:
+                    logger.info("Text too short for direct long-form generation. Extracting context...")
+                    expanded_content = content_extractor.generate_topic_content(text)
+                    combined_content += f"\n\n{expanded_content}"
+                else:
+                    combined_content += f"\n\n{text}"
 
             if topic:
                 topic_content = content_extractor.generate_topic_content(topic)
@@ -97,6 +105,7 @@ def process_content(
                 is_local=is_local,
                 model_name=model_name,
                 api_key_label=api_key_label,
+                longform=longform
             )
 
         if generate_audio:
@@ -171,6 +180,12 @@ def main(
     topic: str = typer.Option(
         None, "--topic", "-tp", help="Topic to generate podcast about"
     ),
+    longform: bool = typer.Option(
+        False, 
+        "--longform", 
+        "-lf", 
+        help="Generate long-form content (only available for text input without images)"
+    ),
 ):
     """
     Generate a podcast or transcript from a list of URLs, a file containing URLs, a transcript file, image files, or raw text.
@@ -204,6 +219,7 @@ def main(
                 model_name=llm_model_name,
                 api_key_label=api_key_label,
                 topic=topic,
+                longform=longform
             )
         else:
             urls_list = urls or []
@@ -227,6 +243,7 @@ def main(
                 model_name=llm_model_name,
                 api_key_label=api_key_label,
                 topic=topic,
+                longform=longform
             )
 
         if transcript_only:
@@ -259,6 +276,7 @@ def generate_podcast(
     llm_model_name: Optional[str] = None,
     api_key_label: Optional[str] = None,
     topic: Optional[str] = None,
+    longform: bool = False,
 ) -> Optional[str]:
     """
     Generate a podcast or transcript from a list of URLs, a file containing URLs, a transcript file, or image files.
@@ -324,6 +342,7 @@ def generate_podcast(
                 model_name=llm_model_name,
                 api_key_label=api_key_label,
                 topic=topic,
+                longform=longform
             )
         else:
             urls_list = urls or []
@@ -349,6 +368,7 @@ def generate_podcast(
                 model_name=llm_model_name,
                 api_key_label=api_key_label,
                 topic=topic,
+                longform=longform
             )
 
     except Exception as e:

diff --git a/podcastfy/config.yaml b/podcastfy/config.yaml
@@ -1,8 +1,15 @@
 content_generator:
-  gemini_model: "gemini-1.5-pro-latest"
+  llm_model: "gemini-1.5-pro-latest"
+  meta_llm_model: "gemini-1.5-flash"
   max_output_tokens: 8192
   prompt_template: "souzatharsis/podcastfy_multimodal_cleanmarkup"
-  prompt_commit: "6c74ab51"
+  prompt_commit: "b2365f11"
+  longform_prompt_template: "souzatharsis/podcastfy_longform"
+  longform_prompt_commit: "d6ac4601"
+  cleaner_prompt_template: "souzatharsis/podcastfy_longform_clean"
+  cleaner_prompt_commit: "8c110a0b"
+  rewriter_prompt_template: "souzatharsis/podcast_rewriter"
+  rewriter_prompt_commit: "6789eeca"
 content_extractor:
   youtube_url_patterns:
     - "youtube.com"