-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring enabling the creation of a lower level API with the Podcast Class #80
base: main
Are you sure you want to change the base?
Conversation
Lower level api final
Further, did you fetch from main before submitting the PR? |
I just merged and push again. |
Thank you ! I see only one failing, but can't really work on it on my computer because I always reach "podcastfy.client:client.py:334 An error occurred: 429 Resource has been exhausted". But maybe it's a side effect of the very same problem that make this test fail? I don't mind help on this one :). |
logger = logging.getLogger(__name__) | ||
|
||
|
||
class DefaultPodcastifyTranscriptEngine(LLMBackend): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DefaultPodcastifyTranscriptEngine is a class 'hardcoded' in a file named 'gemini_langchain.py'
What if we decide for another base llm model as default?
Further the logic implemented by this class has nothing to do with Gemini nor langchain even though it's in gemini_langchain.py
It does sound like this file is here to be backward compatible with the current version in main.py when instead we should move to a unified version such that LLM generic logic should reside under aiengines>llm and podcast content generation logic (post-llm) should live in content_generator.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the phone right now, but it seems that currently it's all about langchain and gemini here? Yes it's absolutely being backward compatible and not forcing other abstractions on the project. I do think you want an abstraction at an intermediate level to easily swap the llm engine but by keeping most of the business logic in this class. But is it something we can do post merge? The current naming and design of this class is not good for sure. The real question is maybe about if you accept or not the current lowest level api for the engines defined by the ABC. There will be another very interesting layer beneath for sure !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it does not make sense to merge a refactor that we already know will need to be refactored.
Let's merge into main small but frequent PRs that are complete.
|
||
logger = setup_logger(__name__) | ||
|
||
app = typer.Typer() | ||
|
||
def create_characters(config: Dict[str, Any]) -> List[Character]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Characters are not considering input conversation_config! This is moving us backwards from a functionality perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a function to make this character things backward compatible. The Character primitive is very promising for this project IMO.
But I don't think I really understand this comment, why does it remove current functionalities ?
communicate = edge_tts.Communicate(text, config.voice) | ||
await communicate.save(str(output_path)) | ||
|
||
# register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you register OpenAI Async as well as EdgeTTS Sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you!
return [host, guest] | ||
|
||
|
||
def create_tts_backends(config: Config) -> List[TTSBackend]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we instantiate all available backends and then later filter out all but the one the user wants? Shouldn't we instead only instantiate what the user needs? I recommend a TTS Factory design pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the massive PR!
I've added a couple of inline comments. But here's the summary recommendation:
This is a large PR, I would say this is not a Merge but instead a full re-rewrite. Besides solving minor comments added now, I'd recommend the following steps:
-
Sync up with main: There were several critical bugs solved in main that are still present in dev. Further, several tests were added which will increase our confidence with this re-write. Let's sync up the branches first.
-
Let's break your changes down into smaller PRs so we can incrementally re-write the repo safely. Recommended components in order:
- First, core + related tests. Here we make sure the new core abstraction works. This update is safe because it won't break the original code.
- Second, LLMs + related tests + client.py
- Third, TTS + related tests + client.py
Feel free to recommend a different way to break up the PR since at this point you know better this new proposed version than myself.
With that approach we can more safely re-write the entire package.
What do you think?
Also, I'd say step 1, i.e. core/ objects are really the true value add here in terms of abstraction. If we get that done, it's already a major release. It would be important to write a short python notebook showcasing how developers can take advantage of that level of abstraction. It will be helpful even to myself to learn this new way. After that it will be straight-forward to accomplish step 2 and 3, next. |
raise ValueError(f"TTS backend '{tts_name}' not configured for this character") | ||
self.preferred_tts = tts_name | ||
|
||
def to_prompt(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be the opposite. Character shouldn't even be aware there exists LLMs around them. LLMs are just a technical solution. Instead, when we compose the prompt in the ai engine, character information should be passed to dynamically compose the prompt.
self.role = role | ||
self.tts_configs = tts_configs | ||
self.default_description_for_llm = default_description_for_llm | ||
self.preferred_tts = next(iter(tts_configs.keys()), None) # Set first TTS as default, can be None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should a character be aware of TTS models?
A Character should describe attributes and behavior of a Podcast participant.
I'd argue perhaps only the Transcript generation should be aware of Character information.
|
||
|
||
class AudioManager: | ||
def __init__(self, tts_backends: Dict[str, TTSBackend], audio_format, n_jobs: int = 4, file_prefix: str = "", audio_temp_dir: str = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default values should live in config.yaml
n_jobs: int = 4
file_prefix: str = ""
audio_temp_dir: str = None
def _get_tts_backend(self, segment): | ||
tts_backend = self.tts_backends.get(segment.speaker.preferred_tts) | ||
if tts_backend is None: | ||
# Take the first available TTS backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why take the first and not the default or user defined?
""" | ||
self.content = content | ||
self.llm_backend = llm_backend | ||
self.characters: Dict[str, Character] = {char.name: char for char in (characters or [Character("Host", "Podcast host", {}), Character("Guest", "Expert guest", {})])} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should take from conversation_config (default or user provided) instead of static values
Hello @souzatharsis thanks a lot for the review! I do agree with most of your points, including the fact that it should have been done in multiple PRs. You can also pick and choose parts that that you think are interesting at your own pace. The last option would be to bite the bullet, merging this knowing that improvements are to be made, including some important changes in the abstractions suggested in this PR. I did good efforts to add tests and ensure that this branch is backward compatible and that tests are passing (including new ones). If we go this route I can spend (synchronously with you or not) a burst of energy to fix what you see high priority stuff and rebase again on merge. But I won't be able to sustain the 3 branchs thing, I am totally certain of that. I am not sure I can really divide the work in multiple branchs and focus with a single one with the fact that this Character things is used almost everywhere. But if you see a path that works for you, I could help. Whatever the option you pick, no ill feelings on my side, working on it was a blast. I hope you will understand my position. I am sorry about the added workload that my frenziness has created on your side. |
I have an idea. Let me create the mini PRs based on your code. And then you are my reviewer / approver. That should reduce the burden on you while taking advantage of the great work you did. What do you think? |
Yeah, I think that one of the most comfortable options for me right now, thank you! Edit: If you have made your mind on this, we could change the status of this one as |
Creation of the
Podcast
allowing a fine level of control in the podcast creation as a step machine.Works with tts and llm engines abiding to simple Abstract Based Classes to ease up future integrations.
tts engines can be both async or sync. If an async tts engine is used, then audio processing takes place in parallel (and in thread for sync engines). It's possible to control the number of jobs in both case.
Introduction to other classes such as
Character
,Transcript
,AudioManager
The behavior of process_content is unchanged and the tests pass.
Created unit tests for the Podcast class (with mocked tts and llm engines)
Added tests for transcript handling, saving, loading and cleaning.
Some code is made obsolete :
text_to_speech.py
for which the business logic has been put in different placesprocess_links
that would be replaced by the body ofprocess_links_v2
which uses thePodcast
class). I preferred to wait to the last moment to commit their deletion.Many things could be improved, but this PR is already too large and should be under discussion beforehand anyway.