Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate content on long form? #202

Open
jtoy opened this issue Nov 30, 2024 · 7 comments
Open

duplicate content on long form? #202

jtoy opened this issue Nov 30, 2024 · 7 comments

Comments

@jtoy
Copy link

jtoy commented Nov 30, 2024

I am noticing duplicate content. in the podcast they will repeat the same concept 3+ times in long form.
I call it like this:
python -m podcastfy.client --text #{safe_text} --topic #{topic} -cc #{config}.yml --longform"

this is the config I am using:
conversation_style:

  • "engaging"
  • "fast-paced"
  • "innovation"
    roles_person1: "main summarizer"
    roles_person2: "questioner/clarifier"
    dialogue_structure:
  • "Discuss current news"
  • "Conclusion"
    podcast_name: "topic Diaries"
    podcast_tagline: "Diving deep into the topic"
    output_language: "English"
    engagement_techniques:
  • "rhetorical questions"
  • "anecdotes"
  • "analogies"
  • "debating"
  • "humor"
    creativity: 1
    max_num_chunks: 8 # maximum number of rounds of discussions in longform
    min_chunk_size: 888 # minimum number of characters to generate a round of discussion in longform

How can one control for this? I am assuming it has something to do with the chunk sizes?

@Webdinero
Copy link

conversation_style:
"engaging"
"concise"
"thought-provoking"
roles_person1: "main summarizer"
roles_person2: "fact-checker"
dialogue_structure:

  • "Discuss current news"
  • "Explore implications"
  • "Provide actionable takeaways"
  • "Conclusion"
    podcast_name: "Topic Diaries"
    podcast_tagline: "Diving deep into the topic"
    output_language: "English"
    engagement_techniques:
  • "analogies"
  • "debating"
    creativity: 0.8
    max_num_chunks: 5
    min_chunk_size: 600

@jtoy
Copy link
Author

jtoy commented Nov 30, 2024

Just verifying, I see a few changes , is the main change to chunk size and max chunk? I’ll test that, why would that fix it?

@ivanmkc
Copy link

ivanmkc commented Dec 2, 2024

Looking at the code for LongFormContentGenerator, there doesn't seem to be a "LLM-based reducer" operation that combines all generated chunk-conversation parts and combines everything into a coherent conversation.

i.e. each chunk is processed independently, with the running output reinjected into the prompt. Dedup is done via enhance_prompt_params's instruction:

"""
Podcast conversation so far is given in CONTEXT.
Continue the natural flow of conversation. Follow-up on the very previous point/question without repeating topics or points already discussed!"
"""
This doesn't seem robust enough.

_clean_transcript_response_DEPRECATED used to have LLM calls that could be appropriated to dedup but the new _clean_transcript_response does not call any LLM.

Seems like there could be a benefit in having a configurable post-processing/reduction step via LLM.

@ivanmkc
Copy link

ivanmkc commented Dec 2, 2024

Something like: #205

Seems to work decently for me.

@jtoy
Copy link
Author

jtoy commented Dec 8, 2024

the changes from earlier changing the numbers didnt work for me.
@ivanmkc are you running a custom branch?

@jtoy
Copy link
Author

jtoy commented Dec 16, 2024

I ended up using my own custom system to generate the transcript as a stop gap. In clocks in around 8 min worth of content which is good enough for me for now.

@souzatharsis
Copy link
Owner

souzatharsis commented Dec 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants