-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicate content on long form? #202
Comments
conversation_style:
|
Just verifying, I see a few changes , is the main change to chunk size and max chunk? I’ll test that, why would that fix it? |
Looking at the code for LongFormContentGenerator, there doesn't seem to be a "LLM-based reducer" operation that combines all generated chunk-conversation parts and combines everything into a coherent conversation. i.e. each chunk is processed independently, with the running output reinjected into the prompt. Dedup is done via enhance_prompt_params's instruction: """ _clean_transcript_response_DEPRECATED used to have LLM calls that could be appropriated to dedup but the new _clean_transcript_response does not call any LLM. Seems like there could be a benefit in having a configurable post-processing/reduction step via LLM. |
Something like: #205 Seems to work decently for me. |
the changes from earlier changing the numbers didnt work for me. |
I ended up using my own custom system to generate the transcript as a stop gap. In clocks in around 8 min worth of content which is good enough for me for now. |
I am glad you worked it out, jtoy!
BTW: I was listening to some NotebookLM audio overviews recently and I've
noticed repetition too, which indicates this can be an effect of repetition
on the input being reflected in the output.
Best Regards,
…--
Thársis
souzatharsis.com <http://www.souzatharsis.com/>
<http://linkedin.com/in/tharsissouza>
On Mon, Dec 16, 2024 at 7:03 PM jtoy ***@***.***> wrote:
I ended up using my own custom system to generate the transcript as a stop
gap. In clocks in around 8 min worth of content which is good enough for me
for now.
—
Reply to this email directly, view it on GitHub
<#202 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADTMY3NBPWV46RWLEAYX7TL2F5E25AVCNFSM6AAAAABSYUGZLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBWHEZDMMJZGQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I am noticing duplicate content. in the podcast they will repeat the same concept 3+ times in long form.
I call it like this:
python -m podcastfy.client --text #{safe_text} --topic #{topic} -cc #{config}.yml --longform"
this is the config I am using:
conversation_style:
roles_person1: "main summarizer"
roles_person2: "questioner/clarifier"
dialogue_structure:
podcast_name: "topic Diaries"
podcast_tagline: "Diving deep into the topic"
output_language: "English"
engagement_techniques:
creativity: 1
max_num_chunks: 8 # maximum number of rounds of discussions in longform
min_chunk_size: 888 # minimum number of characters to generate a round of discussion in longform
How can one control for this? I am assuming it has something to do with the chunk sizes?
The text was updated successfully, but these errors were encountered: