Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actual sentence and sentence_idx #158

Open
VeritasJoker opened this issue Apr 23, 2023 · 6 comments
Open

Actual sentence and sentence_idx #158

VeritasJoker opened this issue Apr 23, 2023 · 6 comments
Assignees

Comments

@VeritasJoker
Copy link
Contributor

@hvgazula

We are trying to use bert to generate podcast embeddings and want to feed in per sentence input.

Do we have actual sentence and sentence_idx in the datum? Right now, the sentence and sentence_idx columns we have in the datums are for utterances. For podcast, we have sentence = None and sentence_idx = 1 for all tokens.

@VeritasJoker VeritasJoker self-assigned this Apr 23, 2023
@hvgazula
Copy link
Collaborator

sentence is identified using the Speaker column. Sadly, in podcast, there is only one speaker. Is there a way to identify the utterance?

@VeritasJoker
Copy link
Contributor Author

I think utterance is identified when a speaker switches, but we just store it inside the sentence column. I am asking if we have a way in the code already to identify individual sentences. For podcast, there are only one utterance but couple hundreds sentences.

@hvgazula
Copy link
Collaborator

Okay, I repeat my question with utterance and sentence swapped. :)

@VeritasJoker
Copy link
Contributor Author

VeritasJoker commented Apr 24, 2023

lol I'm guessing we don't have it in the codebase then. I have this hack from another project that recognizes words as the end of a sentence. I basically check if the last character in word is one of those !?., or alternatively, we can check if a token is one of those !?..

This results in about 400 sentences for podcast, which seems reasonable. Do you think this would work?

@hvgazula
Copy link
Collaborator

df.word[i][-1] in "!?." or i == len(df.index) - 1 I understand the first condition but not the second one. Could you please clarify? Is it for the "last" sentence in the podcast?

@VeritasJoker
Copy link
Contributor Author

VeritasJoker commented Apr 24, 2023

Oh yeah this is in case the last few words got cut off. I don't think we need it for here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants