-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embedding Generation #93
Comments
I had a similar issue where I needed to generate embeddings for 127,906 sequences ranging from 1-40kb, which took 168 hours. The following references provide strategies for speeding up model inference. If you are interested in training a smaller model through model distillation, we can discuss and work together.
|
@callmest Can you provide a code snipet how you have tried to extract the embedding? |
If you want to get embedings, you can try the code in issues 32。 from evo import Evo
import torch
device = 'cuda:0'
evo_model = Evo('evo-1-131k-base')
model, tokenizer = evo_model.model, evo_model.tokenizer
model.to(device)
model.eval()
# monkey patch the unembed function with identity
# this removes the final projection back from the embedding space into tokens
# so the "logits" of the model is now the final layer embedding
# see source for unembed - https://huggingface.co/togethercomputer/evo-1-131k-base/blob/main/model.py#L339
from torch import nn
class CustomEmbedding(nn.Module):
def unembed(self, u):
return u
model.unembed = CustomEmbedding()
# end custom code
sequence = 'ACGT'
input_ids = torch.tensor(
tokenizer.tokenize(sequence),
dtype=torch.int,
).to(device).unsqueeze(0)
embed, _ = model(input_ids) # (batch, length, embed dim)
print('Embed: ', embed)
print('Shape (batch, length, embed dim): ', embed.shape)
# you can now use embedding for downstream classification tasks
# you probably want to aggregate over position dimension
# e.g. mean value = embed.mean(dim=1) or final token embedding = embed[:, -1, :] |
Can vllm produce DNA sequence embedding? Since that the local pytorch inference for generating sequence embedding is too slow. Or there are some methods can produce fast sequence embedding generation? (4,000,000 seqs * 257 tokens / per seq, 147 hours )
The text was updated successfully, but these errors were encountered: