-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Actual sentence and sentence_idx #158
Comments
|
I think utterance is identified when a |
Okay, I repeat my question with utterance and sentence swapped. :) |
lol I'm guessing we don't have it in the codebase then. I have this hack from another project that recognizes words as the end of a sentence. I basically check if the last character in This results in about 400 sentences for podcast, which seems reasonable. Do you think this would work? |
|
Oh yeah this is in case the last few words got cut off. I don't think we need it for here. |
@hvgazula
We are trying to use bert to generate podcast embeddings and want to feed in per sentence input.
Do we have actual sentence and sentence_idx in the datum? Right now, the
sentence
andsentence_idx
columns we have in the datums are for utterances. For podcast, we havesentence = None
andsentence_idx = 1
for all tokens.The text was updated successfully, but these errors were encountered: