Actual sentence and sentence_idx #158

VeritasJoker · 2023-04-23T15:16:28Z

We are trying to use bert to generate podcast embeddings and want to feed in per sentence input.

Do we have actual sentence and sentence_idx in the datum? Right now, the sentence and sentence_idx columns we have in the datums are for utterances. For podcast, we have sentence = None and sentence_idx = 1 for all tokens.

The text was updated successfully, but these errors were encountered:

hvgazula · 2023-04-23T15:18:43Z

sentence is identified using the Speaker column. Sadly, in podcast, there is only one speaker. Is there a way to identify the utterance?

VeritasJoker · 2023-04-23T16:25:26Z

I think utterance is identified when a speaker switches, but we just store it inside the sentence column. I am asking if we have a way in the code already to identify individual sentences. For podcast, there are only one utterance but couple hundreds sentences.

hvgazula · 2023-04-23T19:37:43Z

Okay, I repeat my question with utterance and sentence swapped. :)

VeritasJoker · 2023-04-24T04:00:51Z

lol I'm guessing we don't have it in the codebase then. I have this hack from another project that recognizes words as the end of a sentence. I basically check if the last character in word is one of those !?., or alternatively, we can check if a token is one of those !?..

This results in about 400 sentences for podcast, which seems reasonable. Do you think this would work?

hvgazula · 2023-04-24T10:43:20Z

df.word[i][-1] in "!?." or i == len(df.index) - 1 I understand the first condition but not the second one. Could you please clarify? Is it for the "last" sentence in the podcast?

VeritasJoker · 2023-04-24T10:47:53Z

Oh yeah this is in case the last few words got cut off. I don't think we need it for here.

VeritasJoker self-assigned this Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actual sentence and sentence_idx #158

Actual sentence and sentence_idx #158

VeritasJoker commented Apr 23, 2023

hvgazula commented Apr 23, 2023

VeritasJoker commented Apr 23, 2023

hvgazula commented Apr 23, 2023

VeritasJoker commented Apr 24, 2023 •

edited

Loading

hvgazula commented Apr 24, 2023

VeritasJoker commented Apr 24, 2023 •

edited

Loading

Actual sentence and sentence_idx #158

Actual sentence and sentence_idx #158

Comments

VeritasJoker commented Apr 23, 2023

hvgazula commented Apr 23, 2023

VeritasJoker commented Apr 23, 2023

hvgazula commented Apr 23, 2023

VeritasJoker commented Apr 24, 2023 • edited Loading

hvgazula commented Apr 24, 2023

VeritasJoker commented Apr 24, 2023 • edited Loading

VeritasJoker commented Apr 24, 2023 •

edited

Loading

VeritasJoker commented Apr 24, 2023 •

edited

Loading