Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding Generation #93

Open
callmest opened this issue Nov 25, 2024 · 3 comments
Open

Embedding Generation #93

callmest opened this issue Nov 25, 2024 · 3 comments

Comments

@callmest
Copy link

Can vllm produce DNA sequence embedding? Since that the local pytorch inference for generating sequence embedding is too slow. Or there are some methods can produce fast sequence embedding generation? (4,000,000 seqs * 257 tokens / per seq, 147 hours )

@JiaLonghao1997
Copy link

Can vllm produce DNA sequence embedding? Since that the local pytorch inference for generating sequence embedding is too slow. Or there are some methods can produce fast sequence embedding generation? (4,000,000 seqs * 257 tokens / per seq, 147 hours )

I had a similar issue where I needed to generate embeddings for 127,906 sequences ranging from 1-40kb, which took 168 hours. The following references provide strategies for speeding up model inference. If you are interested in training a smaller model through model distillation, we can discuss and work together.

@gergo-szabo
Copy link

@callmest Can you provide a code snipet how you have tried to extract the embedding?

@JiaLonghao1997
Copy link

@callmest Can you provide a code snipet how you have tried to extract the embedding?

If you want to get embedings, you can try the code in issues 32

from evo import Evo
import torch

device = 'cuda:0'

evo_model = Evo('evo-1-131k-base')
model, tokenizer = evo_model.model, evo_model.tokenizer
model.to(device)
model.eval()

# monkey patch the unembed function with identity
# this removes the final projection back from the embedding space into tokens
# so the "logits" of the model is now the final layer embedding
# see source for unembed - https://huggingface.co/togethercomputer/evo-1-131k-base/blob/main/model.py#L339

from torch import nn

class CustomEmbedding(nn.Module):
  def unembed(self, u):
    return u

model.unembed = CustomEmbedding()

# end custom code

sequence = 'ACGT'
input_ids = torch.tensor(
    tokenizer.tokenize(sequence),
    dtype=torch.int,
).to(device).unsqueeze(0)

embed, _ = model(input_ids) # (batch, length, embed dim)

print('Embed: ', embed)
print('Shape (batch, length, embed dim): ', embed.shape)

# you can now use embedding for downstream classification tasks
# you probably want to aggregate over position dimension
# e.g. mean value = embed.mean(dim=1) or final token embedding = embed[:, -1, :]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants