Merge pull request #180 from souzatharsis/feat/newTTS

v0.4.0 - add Google's TTS models
souzatharsis · Nov 16, 2024 · a5b707c · a5b707c
2 parents b4c5d4b + e6f2b5a
commit a5b707c
Show file tree

Hide file tree

Showing 58 changed files with 1,727 additions and 448 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,13 @@
 # Changelog
 
+
+## [0.4.0] - 2024-11-16
+
+### Added
+- Add Google Singlespeaker (Journey) and Multispeaker TTS models 
+- Fixed limitations of Google Multispeaker TTS model: 5000 bytes input limite and 500 bytes per turn limit.
+- Updated tests and docs accordingly
+
 ## [0.3.6] - 2024-11-13
 
 ### Added

diff --git a/README.md b/README.md
@@ -31,24 +31,23 @@ Unlike closed-source UI-based tools focused primarily on research synthesis (e.g
 [![Star History Chart](https://api.star-history.com/svg?repos=souzatharsis/podcastfy&type=Date&theme=dark)](https://api.star-history.com/svg?repos=souzatharsis/podcastfy&type=Date&theme=dark)
 
 ## Audio Examples 🔊
-This sample collection is also [available at audio.com](https://audio.com/thatupiso/collections/podcastfy).
+This sample collection was generated using this [Python Notebook](usage/examples.ipynb).
 
 ### Images
 
-| Image Set | Description | Audio |
+| Audio | Description | Image Set |
 |:--|:--|:--|
-| <img src="data/images/Senecio.jpeg" alt="Senecio, 1922 (Paul Klee)" width="20%" height="auto"> <img src="data/images/connection.jpg" alt="Connection of Civilizations (2017) by Gheorghe Virtosu " width="21.5%" height="auto"> | Senecio, 1922 (Paul Klee) and Connection of Civilizations (2017) by Gheorghe Virtosu  | [<span style="font-size: 25px;">🔊</span>](https://audio.com/thatupiso/audio/output-file-abstract-art) |
-| <img src="data/images/japan_1.jpg" alt="The Great Wave off Kanagawa, 1831 (Hokusai)" width="20%" height="auto"> <img src="data/images/japan2.jpg" alt="Takiyasha the Witch and the Skeleton Spectre, c. 1844 (Kuniyoshi)" width="21.5%" height="auto"> | The Great Wave off Kanagawa, 1831 (Hokusai) and Takiyasha the Witch and the Skeleton Spectre, c. 1844 (Kuniyoshi) | [<span style="font-size: 25px;">🔊</span>](https://audio.com/thatupiso/audio/output-file-japan) |
-| <img src="data/images/taylor.png" alt="Taylor Swift" width="28%" height="auto"> <img src="data/images/monalisa.jpeg" alt="Mona Lisa" width="10.5%" height="auto"> | Pop culture icon Taylor Swift and Mona Lisa, 1503 (Leonardo da Vinci) | [<span style="font-size: 25px;">🔊</span>](https://audio.com/thatupiso/audio/taylor-monalisa) |
+| <video src="usage/video/senecio.mp4"></video> | Senecio, 1922 (Paul Klee) and Connection of Civilizations (2017) by Gheorghe Virtosu | <img src="data/images/Senecio.jpeg" alt="Senecio, 1922 (Paul Klee)" width="20%" height="auto"> <img src="data/images/connection.jpg" alt="Connection of Civilizations (2017) by Gheorghe Virtosu " width="21.5%" height="auto"> |
+| <video src="usage/video/japan.mp4"></video> | The Great Wave off Kanagawa, 1831 (Hokusai) and Takiyasha the Witch and the Skeleton Spectre, c. 1844 (Kuniyoshi) | <img src="data/images/japan_1.jpg" alt="The Great Wave off Kanagawa, 1831 (Hokusai)" width="20%" height="auto"> <img src="data/images/japan2.jpg" alt="Takiyasha the Witch and the Skeleton Spectre, c. 1844 (Kuniyoshi)" width="21.5%" height="auto"> |
+| <video src="usage/video/taylor.mp4"></video> | Pop culture icon Taylor Swift and Mona Lisa, 1503 (Leonardo da Vinci) | <img src="data/images/taylor.png" alt="Taylor Swift" width="28%" height="auto"> <img src="data/images/monalisa.jpeg" alt="Mona Lisa" width="10.5%" height="auto"> |
+
 
 ### Text
-| Content Type | Description | Audio | Source |
-|--------------|-------------|-------|--------|
-| Youtube Video | YCombinator on LLMs | [Audio](https://audio.com/thatupiso/audio/ycombinator-llms) | [YouTube](https://www.youtube.com/watch?v=eBVi_sLaYsc) |
-| PDF | Book: Networks, Crowds, and Markets | [Audio](https://audio.com/thatupiso/audio/networks) | book pdf |
-| Research Paper | Climate Change in France | [Audio](https://audio.com/thatupiso/audio/agro-paper) | [PDF](./data/pdf/s41598-024-58826-w.pdf) |
-| Website | My Personal Website | [Audio](https://audio.com/thatupiso/audio/tharsis) | [Website](https://www.souzatharsis.com) |
-| Website + YouTube | My Personal Website + YouTube Video on AI | [Audio](https://audio.com/thatupiso/audio/tharsis-ai) | [Website](https://www.souzatharsis.com), [YouTube](https://www.youtube.com/watch?v=sJE1dE2dulg) |
+| Audio | Description | Content Type | Source |
+|-------|-------------|--------------|--------|
+| <video src="usage/video/taylor.mp4"></video>  | Person Website | Website | [Website](www.souzatharsis.com) |
+| [Audio](https://soundcloud.com/high-lander123/amodei?in=high-lander123/sets/podcastfy-sample-audio-longform&si=b8dfaf4e3ddc4651835e277500384156) | Lex Fridman Podcast: Dario Amodei Anthropic's CEO | Youtube | [Youtube](https://www.youtube.com/watch?v=ugvHCXCOmm4) |
+| [Audio](https://soundcloud.com/high-lander123/benjamin?in=high-lander123/sets/podcastfy-sample-audio-longform&si=dca7e2eec1c94252be18b8794499959a&utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing) | Benjamin Franklin's Autobiography | Youtube | [Book](https://www.youtube.com/watch?v=ugvHCXCOmm4) |
 
 ### Multi-Lingual Text
 | Language | Content Type | Description | Audio | Source |
@@ -58,7 +57,7 @@ This sample collection is also [available at audio.com](https://audio.com/thatup
 
 ## Features ✨
 
-- Generate conversational content from multiple sources and formats (images, websites, YouTube, and PDFs).
+- Generate conversational content from multiple sources and formats (images, text, websites, YouTube, and PDFs).
 - Generate shorts (2-5 minutes) or longform (30+ minutes) podcasts.
 - Customize transcript and audio generation (e.g., style, language, structure).
 - Generate transcripts using 100+ LLM models (OpenAI, Anthropic, Google etc).
@@ -75,13 +74,13 @@ This sample collection is also [available at audio.com](https://audio.com/thatup
 - [Podcastfy-HuggingFace App](https://huggingface.co/spaces/thatupiso/Podcastfy.ai_demo)
 - [Podcastfy-UI](https://github.com/giulioco/podcastfy-ui)
 
-## Updates 🚀
+## Updates 🚀🚀
 
-### v0.3.6+ release
-- Generate shorts or longform podcasts!
-- Generate podcasts from input topic using real-time internet search
+### v0.4.0+ release
+- Released new Multi-Speaker TTS model (is it the one NotebookLM uses?!?)
+- Generate short or longform podcasts
+- Generate podcasts from input topic using grounded real-time web search
 - Integrate with 100+ LLM models (OpenAI, Anthropic, Google etc) for transcript generation
-- Integrate with Google's Multispeaker TTS model for high-quality audio generation
 
 See [CHANGELOG](CHANGELOG.md) for more details.
 
@@ -112,13 +111,14 @@ python -m podcastfy.client --url <url1> --url <url2>
 
 - [Python Package Quickstart](podcastfy.ipynb)
 
+- [How to](usage/how-to.md)
+
 - [Python Package Reference Manual](https://podcastfy.readthedocs.io/en/latest/podcastfy.html)
 
 - [REST API Reference Manual](usage/api.md)
 
 - [CLI](usage/cli.md)
 
-- [How to](usage/how-to.md)
 
 Experience Podcastfy with our [HuggingFace](https://huggingface.co/spaces/thatupiso/Podcastfy.ai_demo) 🤗 Spaces app. (Note: This UI app is less extensively tested than the Python package.)
 
@@ -132,7 +132,7 @@ Podcastfy offers a range of customization options to tailor your AI-generated po
 
 ## License
 
-This software is licensed under [Apache 2.0](LICENSE). [Here](usage/license-guide.md) are a few instructions if you would like to use podcastfy in your software.
+This software is licensed under [Apache 2.0](LICENSE). See [instructions](usage/license-guide.md) if you would like to use podcastfy in your software.
 
 ## Contributing 🤝
 

diff --git a/data/transcripts/Tharsis.txt b/data/transcripts/Tharsis.txt
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -9,7 +9,7 @@
 project = 'podcastfy'
 copyright = '2024, Tharsis T. P. Souza'
 author = 'Tharsis T. P. Souza'
-release = 'v0.3.1'
+release = 'v0.4.0'
 
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

diff --git a/paper/paper.md b/paper/paper.md
@@ -43,7 +43,6 @@ The rapid expansion of digital content across various formats has intensified th
 
 See [audio samples](https://github.com/souzatharsis/podcastfy?tab=readme-ov-file#audio-examples-).
 
-<!--
 # Use Cases
 
 `Podcastfy` is designed to serve a wide range of applications, including:
@@ -55,7 +54,6 @@ See [audio samples](https://github.com/souzatharsis/podcastfy?tab=readme-ov-file
 - **Researchers** can convert research papers, visual data, and technical content into conversational audio. This makes it easier for a wider audience, including those with disabilities, to consume and understand complex scientific information. Researchers can also create audio summaries of their work to enhance accessibility.
 
 - **Accessibility Advocates** can use `Podcastfy` to promote digital accessibility by providing a tool that converts multimodal content into auditory formats. This helps individuals with visual impairments, dyslexia, or other disabilities that make it challenging to consume written or visual content.
--->
 
 
 # Implementation and Architecture
@@ -190,7 +188,6 @@ generate_podcast(
 The roles are set to "expert developer" and "learning developer" to create a natural teaching dynamic. The dialogue structure follows a logical progression from concept introduction through implementation and best practices. The engagement_techniques parameter ensures the content remains practical and applicable by incorporating code examples, real-world applications, and troubleshooting guidance. A moderate creativity setting (0.4) maintains technical accuracy while allowing for engaging explanations and examples.
 
 
-<!--
 ## Storytelling Adventure
 
 The following Python code demonstrates how to generate a storytelling podcast:
@@ -359,7 +356,6 @@ This example demonstrates how to use the `TextToSpeech` class to convert generat
   - May require additional processing for users with specific accessibility needs.
 
 These limitations highlight areas for future development and improvement of the framework. Users should carefully consider these constraints when implementing `Podcastfy` for their specific use cases and requirements.
--->
 
 # Limitations
 
@@ -372,11 +368,8 @@ These limitations highlight areas for future development and improvement of the
 
 `Podcastfy` contributes to multimodal content accessibility by enabling the programmatic transformation of digital content into conversational audio. The framework addresses accessibility needs through automated content summarization and natural-sounding speech synthesis. Its modular design and configurable options allow for flexible content processing and audio generation workflows that can be adapted for different use cases and requirements.
 
-<!--
-As an open-source project, `Podcastfy` benefits from continuous community-driven improvements and adaptations, helping support its long-term value and relevance in meeting evolving user requirements and accessibility standards.
-
 We invite contributions from the community to further enhance the capabilities of `Podcastfy`. Whether it's by adding support for new input modalities, improving the quality of conversation generation, or optimizing the TTS synthesis, we welcome collaboration to make `Podcastfy` more powerful and versatile.
--->
+
 
 # Acknowledgements
 

diff --git a/podcastfy.ipynb b/podcastfy.ipynb
diff --git a/podcastfy/client.py b/podcastfy/client.py
@@ -19,12 +19,24 @@
 from typing import List, Optional, Dict, Any
 import copy
 
+import logging
+
+# Configure logging to show all levels and write to both file and console
+""" logging.basicConfig(
+    level=logging.DEBUG,  # Show all levels of logs
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.FileHandler('podcastfy.log'),  # Save to file
+        logging.StreamHandler()  # Print to console
+    ]
+) """
+
 
 logger = setup_logger(__name__)
 
 app = typer.Typer()
 
-os.environ["LANGCHAIN_TRACING_V2"] = "false"
+os.environ["LANGCHAIN_TRACING_V2"] = "False"
 
 
 def process_content(
@@ -70,7 +82,10 @@ def process_content(
                 content_extractor = ContentExtractor()
 
             content_generator = ContentGenerator(
-                api_key=config.GEMINI_API_KEY, conversation_config=conv_config.to_dict()
+                is_local=is_local,
+                model_name=model_name,
+                api_key_label=api_key_label,
+                conversation_config=conv_config.to_dict()
             )
 
             combined_content = ""
@@ -102,16 +117,13 @@ def process_content(
                 combined_content,
                 image_file_paths=image_paths or [],
                 output_filepath=transcript_filepath,
-                is_local=is_local,
-                model_name=model_name,
-                api_key_label=api_key_label,
                 longform=longform
             )
 
         if generate_audio:
             api_key = None
             if tts_model != "edge":
-                api_key = getattr(config, f"{tts_model.upper()}_API_KEY")
+                api_key = getattr(config, f"{tts_model.upper().replace('MULTI', '')}_API_KEY")
 
             text_to_speech = TextToSpeech(
                 model=tts_model,
@@ -300,6 +312,7 @@ def generate_podcast(
         Optional[str]: Path to the final podcast audio file, or None if only generating a transcript.
     """
     try:
+        print("Generating podcast...")
         # Load default config
         default_config = load_config()
 

diff --git a/podcastfy/config.yaml b/podcastfy/config.yaml
@@ -1,15 +1,15 @@
 content_generator:
   llm_model: "gemini-1.5-pro-latest"
-  meta_llm_model: "gemini-1.5-flash"
+  meta_llm_model: "gemini-1.5-pro-latest"
   max_output_tokens: 8192
   prompt_template: "souzatharsis/podcastfy_multimodal_cleanmarkup"
   prompt_commit: "b2365f11"
   longform_prompt_template: "souzatharsis/podcastfy_longform"
-  longform_prompt_commit: "d6ac4601"
+  longform_prompt_commit: "acfdbc91" #"ff865019"
   cleaner_prompt_template: "souzatharsis/podcastfy_longform_clean"
   cleaner_prompt_commit: "8c110a0b"
   rewriter_prompt_template: "souzatharsis/podcast_rewriter"
-  rewriter_prompt_commit: "6789eeca"
+  rewriter_prompt_commit: "8ee296fb"
 content_extractor:
   youtube_url_patterns:
     - "youtube.com"