Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] About offset in datasets/preprocess.py #38

Open
world1tree opened this issue Oct 6, 2024 · 0 comments
Open

[Question] About offset in datasets/preprocess.py #38

world1tree opened this issue Oct 6, 2024 · 0 comments

Comments

@world1tree
Copy link

def tokenizer_speech_token(prompt, tokenizer, speech_token_index=SPEECH_TOKEN_INDEX, return_tensors=None):
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<speech>')]
def insert_separator(X, sep):
return [ele for sublist in zip(X, [sep]*len(X)) for ele in sublist][:-1]
input_ids = []
offset = 0
if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
offset = 1
input_ids.append(prompt_chunks[0][0])
for x in insert_separator(prompt_chunks, [speech_token_index] * (offset + 1)):
input_ids.extend(x[offset:])
if return_tensors is not None:
if return_tensors == 'pt':
return torch.tensor(input_ids, dtype=torch.long)
raise ValueError(f'Unsupported tensor type: {return_tensors}')
return input_ids

Great work on this! But I’m curious about above code.

From my understanding, the entire prompt is split by <speech>, and under current setting, there will only be one <speech>, which means len(prompt_chunks) == 2, right? Prompt_chunks[0] contains <bos>, while prompt_chunks[1] does not. What does the offset do in this code? Why is it [speech_token_index] * (offset + 1)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant