duplicate content on long form? #202

jtoy · 2024-11-30T15:29:08Z

I am noticing duplicate content. in the podcast they will repeat the same concept 3+ times in long form.
I call it like this:
python -m podcastfy.client --text #{safe_text} --topic #{topic} -cc #{config}.yml --longform"

this is the config I am using:
conversation_style:

"engaging"
"fast-paced"
"innovation"
roles_person1: "main summarizer"
roles_person2: "questioner/clarifier"
dialogue_structure:
"Discuss current news"
"Conclusion"
podcast_name: "topic Diaries"
podcast_tagline: "Diving deep into the topic"
output_language: "English"
engagement_techniques:
"rhetorical questions"
"anecdotes"
"analogies"
"debating"
"humor"
creativity: 1
max_num_chunks: 8 # maximum number of rounds of discussions in longform
min_chunk_size: 888 # minimum number of characters to generate a round of discussion in longform

How can one control for this? I am assuming it has something to do with the chunk sizes?

Webdinero · 2024-11-30T22:45:27Z

conversation_style:
"engaging"
"concise"
"thought-provoking"
roles_person1: "main summarizer"
roles_person2: "fact-checker"
dialogue_structure:

"Discuss current news"
"Explore implications"
"Provide actionable takeaways"
"Conclusion"
podcast_name: "Topic Diaries"
podcast_tagline: "Diving deep into the topic"
output_language: "English"
engagement_techniques:
"analogies"
"debating"
creativity: 0.8
max_num_chunks: 5
min_chunk_size: 600

jtoy · 2024-11-30T23:37:02Z

Just verifying, I see a few changes , is the main change to chunk size and max chunk? I’ll test that, why would that fix it?

ivanmkc · 2024-12-02T03:57:01Z

Looking at the code for LongFormContentGenerator, there doesn't seem to be a "LLM-based reducer" operation that combines all generated chunk-conversation parts and combines everything into a coherent conversation.

i.e. each chunk is processed independently, with the running output reinjected into the prompt. Dedup is done via enhance_prompt_params's instruction:

"""
Podcast conversation so far is given in CONTEXT.
Continue the natural flow of conversation. Follow-up on the very previous point/question without repeating topics or points already discussed!"
"""
This doesn't seem robust enough.

_clean_transcript_response_DEPRECATED used to have LLM calls that could be appropriated to dedup but the new _clean_transcript_response does not call any LLM.

Seems like there could be a benefit in having a configurable post-processing/reduction step via LLM.

ivanmkc · 2024-12-02T19:07:59Z

Something like: #205

Seems to work decently for me.

jtoy · 2024-12-08T23:58:31Z

the changes from earlier changing the numbers didnt work for me.
@ivanmkc are you running a custom branch?

jtoy · 2024-12-16T22:03:03Z

I ended up using my own custom system to generate the transcript as a stop gap. In clocks in around 8 min worth of content which is good enough for me for now.

souzatharsis · 2024-12-16T22:14:59Z

I am glad you worked it out, jtoy! BTW: I was listening to some NotebookLM audio overviews recently and I've noticed repetition too, which indicates this can be an effect of repetition on the input being reflected in the output. Best Regards,

…

-- Thársis souzatharsis.com <http://www.souzatharsis.com/> <http://linkedin.com/in/tharsissouza>

On Mon, Dec 16, 2024 at 7:03 PM jtoy ***@***.***> wrote: I ended up using my own custom system to generate the transcript as a stop gap. In clocks in around 8 min worth of content which is good enough for me for now. — Reply to this email directly, view it on GitHub <#202 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTMY3NBPWV46RWLEAYX7TL2F5E25AVCNFSM6AAAAABSYUGZLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBWHEZDMMJZGQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ivanmkc mentioned this issue Dec 6, 2024

[Proof-of-concept] Added configurable reduction step after longform chunk transcript generation #205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate content on long form? #202

duplicate content on long form? #202

jtoy commented Nov 30, 2024

Webdinero commented Nov 30, 2024

jtoy commented Nov 30, 2024

ivanmkc commented Dec 2, 2024 •

edited

Loading

ivanmkc commented Dec 2, 2024 •

edited

Loading

jtoy commented Dec 8, 2024

jtoy commented Dec 16, 2024

souzatharsis commented Dec 16, 2024 via email

duplicate content on long form? #202

duplicate content on long form? #202

Comments

jtoy commented Nov 30, 2024

Webdinero commented Nov 30, 2024

jtoy commented Nov 30, 2024

ivanmkc commented Dec 2, 2024 • edited Loading

ivanmkc commented Dec 2, 2024 • edited Loading

jtoy commented Dec 8, 2024

jtoy commented Dec 16, 2024

souzatharsis commented Dec 16, 2024 via email

ivanmkc commented Dec 2, 2024 •

edited

Loading

ivanmkc commented Dec 2, 2024 •

edited

Loading