Speed up via xformers #120

mortonjt · 2023-06-14T11:06:08Z

Just in case you weren't familiar with this, there is an xformers library that can allow for a >4x speed up on all transformer operations
https://github.com/facebookresearch/xformers

Could be low hanging fruit to speed up the operations in this library :)

mheinzinger · 2023-08-28T07:12:52Z

Hi Jamie,

thanks for reaching out! - I wanted to try this before answering but obviously it took me way too long. I already gave it a shot few weeks ago but failed to reach some significant speed-up but maybe I did something wrong (used it for translation on the new ProstT5 model).

Have you made positive experience with this using some protein language models?

mortonjt · 2023-08-28T07:20:09Z

Sorry to hear about that. I haven't tried this out yet for protein LLMs (only tested it out on stable-diffusion), but it is on my radar. Hoping that it could be useful for inference and speed up the embedding calculations (which we're noticing is a bottleneck for protein annotation)

mheinzinger · 2023-08-28T07:29:51Z

Hm, how many proteins are you trying to label? - From my experience ProtT5-XL-U50 encoder-only in half-precision using batching as described here reaches around 0.1s/protein on average for the 20k proteins human (so around 30m for human).

mheinzinger · 2023-08-28T08:02:42Z

I had a brief look and I stopped once I hit the following error: AttributeError: 'FeatureExtractionPipeline' object has no attribute 'enable_xformers_memory_efficient_attention' (tried to extract embeddings from the ProtT5-XL-U50-fp16 model from my link in the post above).
So not sure whether it is as easily plug-n-play as I had hoped. In case you find some example/tutorial that shows how this should be done for plain Transformers (no diffusion etc), pls send by and I can give it a try. So far, I only found tutorials on how to use this on diffusion models in huggingface (but most likely I just missed the right source)

mortonjt · 2023-08-28T08:24:10Z

Regarding examples, I first saw xformers being used in https://github.com/Stability-AI/stablediffusion — so yes I only saw this used in diffusion models We were trying to embed all of uniref at one point, but had to resort to just a subset. Were trying to embed proteins in microbial metagenomes, and those reference databases are often >50M proteins

…

On Mon, Aug 28, 2023 at 10:02 AM Michael Heinzinger < ***@***.***> wrote: I had a brief look and I stopped once I hit the following error: AttributeError: 'FeatureExtractionPipeline' object has no attribute 'enable_xformers_memory_efficient_attention' (tried to extract embeddings from the ProtT5-XL-U50-fp16 model from my link in the post above). So not sure whether it is as easily plug-n-play as I had hoped. In case you find some example/tutorial that shows how this should be done for plain Transformers (no diffusion etc), pls send by and I can give it a try. So far, I only found tutorials on how to use this on diffusion models in huggingface (but most likely I just missed the right source) — Reply to this email directly, view it on GitHub <#120 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA75VXNWFRJVOGIAQY2OB73XXRGCZANCNFSM6AAAAAAZGF66MI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mheinzinger · 2023-08-28T14:34:00Z

Yeah, I see your point. We also ran UniRef50 at one point but only to make predictions, not for embedding extraction (esp. as storing those embeddings becomes expensive quickly).
Only things I can recommend (probably obvious but still):

sort & process sequences by length to avoid padding.
batch-processing & half-precision to max-out batch sizes
If you write chunks of proteins (after sorting by length), e.g., splitting UniRef50 into 50 chunks á 1M proteins, you can parallelize embedding extraction on multiple GPUs
Set a upper length limit. Long proteins are the main reason for slow-down. If you only remove proteins longer than the AlphaFold-DB length limit (>1280 residues), you can already reduce the average embedding time from 0.1s/protein to 0.035s/protein for the human proteome while loosing only 5% of the data (19k/20k human proteins are <1280 residues)
Maybe check TensorRT T5-example (no experience, though)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up via xformers #120

Speed up via xformers #120

mortonjt commented Jun 14, 2023

mheinzinger commented Aug 28, 2023 •

edited

Loading

mortonjt commented Aug 28, 2023

mheinzinger commented Aug 28, 2023

mheinzinger commented Aug 28, 2023

mortonjt commented Aug 28, 2023 via email

mheinzinger commented Aug 28, 2023 •

edited

Loading

Speed up via xformers #120

Speed up via xformers #120

Comments

mortonjt commented Jun 14, 2023

mheinzinger commented Aug 28, 2023 • edited Loading

mortonjt commented Aug 28, 2023

mheinzinger commented Aug 28, 2023

mheinzinger commented Aug 28, 2023

mortonjt commented Aug 28, 2023 via email

mheinzinger commented Aug 28, 2023 • edited Loading

mheinzinger commented Aug 28, 2023 •

edited

Loading

mheinzinger commented Aug 28, 2023 •

edited

Loading