[Question] About offset in datasets/preprocess.py #38

world1tree · 2024-10-06T14:03:27Z

LLaMA-Omni/omni_speech/datasets/preprocess.py

Lines 36 to 55 in 7ec4f18

    
           def tokenizer_speech_token(prompt, tokenizer, speech_token_index=SPEECH_TOKEN_INDEX, return_tensors=None): 
        
               prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<speech>')] 
        
               def insert_separator(X, sep): 
        
                   return [ele for sublist in zip(X, [sep]*len(X)) for ele in sublist][:-1] 
        
               input_ids = [] 
        
               offset = 0 
        
               if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id: 
        
                   offset = 1 
        
                   input_ids.append(prompt_chunks[0][0]) 
        
               for x in insert_separator(prompt_chunks, [speech_token_index] * (offset + 1)): 
        
                   input_ids.extend(x[offset:]) 
        
               if return_tensors is not None: 
        
                   if return_tensors == 'pt': 
        
                       return torch.tensor(input_ids, dtype=torch.long) 
        
                   raise ValueError(f'Unsupported tensor type: {return_tensors}') 
        
               return input_ids

Great work on this! But I’m curious about above code.

From my understanding, the entire prompt is split by <speech>, and under current setting, there will only be one <speech>, which means len(prompt_chunks) == 2, right? Prompt_chunks[0] contains <bos>, while prompt_chunks[1] does not. What does the offset do in this code? Why is it [speech_token_index] * (offset + 1)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] About offset in datasets/preprocess.py #38

[Question] About offset in datasets/preprocess.py #38

world1tree commented Oct 6, 2024

[Question] About offset in datasets/preprocess.py #38

[Question] About offset in datasets/preprocess.py #38

Comments

world1tree commented Oct 6, 2024