Embedding Generation #93

callmest · 2024-11-25T10:55:00Z

Can vllm produce DNA sequence embedding? Since that the local pytorch inference for generating sequence embedding is too slow. Or there are some methods can produce fast sequence embedding generation? (4,000,000 seqs * 257 tokens / per seq, 147 hours )

JiaLonghao1997 · 2024-12-14T11:27:47Z

Can vllm produce DNA sequence embedding? Since that the local pytorch inference for generating sequence embedding is too slow. Or there are some methods can produce fast sequence embedding generation? (4,000,000 seqs * 257 tokens / per seq, 147 hours )

I had a similar issue where I needed to generate embeddings for 127,906 sequences ranging from 1-40kb, which took 168 hours. The following references provide strategies for speeding up model inference. If you are interested in training a smaller model through model distillation, we can discuss and work together.

Pytorch tutorial: https://pytorch.org/serve/performance_checklist.html
HuggingFace tutorial： https://huggingface.co/docs/transformers/main/en/llm_optims
Large Transformer Model Inference Optimization：https://lilianweng.github.io/posts/2023-01-10-inference-optimization/

gergo-szabo · 2025-01-01T10:02:02Z

@callmest Can you provide a code snipet how you have tried to extract the embedding?

JiaLonghao1997 · 2025-01-01T11:47:19Z

@callmest Can you provide a code snipet how you have tried to extract the embedding?

If you want to get embedings, you can try the code in issues 32。

from evo import Evo
import torch

device = 'cuda:0'

evo_model = Evo('evo-1-131k-base')
model, tokenizer = evo_model.model, evo_model.tokenizer
model.to(device)
model.eval()

# monkey patch the unembed function with identity
# this removes the final projection back from the embedding space into tokens
# so the "logits" of the model is now the final layer embedding
# see source for unembed - https://huggingface.co/togethercomputer/evo-1-131k-base/blob/main/model.py#L339

from torch import nn

class CustomEmbedding(nn.Module):
  def unembed(self, u):
    return u

model.unembed = CustomEmbedding()

# end custom code

sequence = 'ACGT'
input_ids = torch.tensor(
    tokenizer.tokenize(sequence),
    dtype=torch.int,
).to(device).unsqueeze(0)

embed, _ = model(input_ids) # (batch, length, embed dim)

print('Embed: ', embed)
print('Shape (batch, length, embed dim): ', embed.shape)

# you can now use embedding for downstream classification tasks
# you probably want to aggregate over position dimension
# e.g. mean value = embed.mean(dim=1) or final token embedding = embed[:, -1, :]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding Generation #93

Embedding Generation #93

callmest commented Nov 25, 2024

JiaLonghao1997 commented Dec 14, 2024

gergo-szabo commented Jan 1, 2025

JiaLonghao1997 commented Jan 1, 2025

Embedding Generation #93

Embedding Generation #93

Comments

callmest commented Nov 25, 2024

JiaLonghao1997 commented Dec 14, 2024

gergo-szabo commented Jan 1, 2025

JiaLonghao1997 commented Jan 1, 2025