Accelerate Inference in TransformerLens #26

StarConnor · 2024-06-29T08:30:47Z

Add use_flash_attn option when loading a HookedTransformer model.

Add FlashAttentionV2 support in TransformerLens/transformer_lens/components/abstract_attention.py:

Language-Model-SAEs/TransformerLens/transformer_lens/components/abstract_attention.py

Lines 218 to 244 in a12cc22

    
           if self.cfg.use_flash_attn: 
        
               # use FlashAttentionV2 to accelerate inference. self.hook_attn_scores, self.hook_pattern, self.hook_z are not supported in this case. 
        
               # Contains at least one padding token in the sequence 
        
               causal = True if self.cfg.attention_dir == "causal" else False 
        
               if attention_mask is not None: 
        
                   batch_size, query_length, _ = q.shape 
        
                   query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input( 
        
                       q, k, v, attention_mask, q.shape[1] 
        
                   ) 
        
                   cu_seqlens_q, cu_seqlens_k = cu_seq_lens 
        
                   max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens 
        
                   attn_output_unpad = flash_attn_varlen_func( 
        
                       query_states, 
        
                       key_states, 
        
                       value_states, 
        
                       cu_seqlens_q=cu_seqlens_q, 
        
                       cu_seqlens_k=cu_seqlens_k, 
        
                       max_seqlen_q=max_seqlen_in_batch_q, 
        
                       max_seqlen_k=max_seqlen_in_batch_k, 
        
                       causal=causal, 
        
                   ) 
        
                   z = pad_input(attn_output_unpad, indices_q, batch_size, query_length) 
        
               else: 
        
                   z = flash_attn_func(q, k, v, causal=causal)

Add tests of flash attention correctness in TransformerLens/tests/integration/test_flash_attn.py. To explain how to pass the test, I will give some definitions:
a = activation(tl-w/flash_attn), b= activation(tl-wo/flash_attn)
a'= activation(hf-w/flash_attn), b'= activation(hf-wo/flash_attn)
error_tl=max(|a-b|) and error_hf=max(|a'-b'|) for attention, MLP and residual stream activations in every layer.
If error_tl < error_hf * 5, then the test is passed. Actually, error_tl is sometimes smaller than error_hf, so I think "5" is not that big.

…ort for flash_attn

…ttps://github.com/OpenMOSS/Language-Model-SAEs into 11-proposal-accelerate-inference-in-transformerlens

…into 11-proposal-accelerate-inference-in-transformerlens

dest1n1s · 2024-07-02T09:45:13Z

pyproject.toml

+implicit_optional=true
+
+[build-system]


Please explain why these requirements are necessary.

dest1n1s · 2024-07-02T09:48:22Z

TransformerLens/transformer_lens/components/abstract_attention.py

@@ -195,45 +219,72 @@ def forward(
                self.apply_rotary(k, 0, attention_mask)
            )  # keys are cached so no offset

-        if self.cfg.dtype not in [torch.float32, torch.float64]:
+        if self.cfg.dtype not in [torch.float32, torch.float64] and self.cfg.dtype != torch.bfloat16:


Please explain why excluding torch.bfloat16. Besides, torch.bfloat16 could be put inside the exclusion lists.

dest1n1s · 2024-07-02T09:49:38Z

TransformerLens/transformer_lens/components/abstract_attention.py

@@ -656,3 +707,41 @@ def create_alibi_bias(
        alibi_bias = torch.einsum("ij,k->kij", slope, multipliers)

        return alibi_bias
+
+    def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):


Add necessary type hints and comments to this function.

dest1n1s · 2024-07-02T09:53:22Z

tests/conftest.py

Is this file testing if configs can be successfully created? If true, it seems better to try creating several hard-coded configs instead of depending on command line arguments for the sake of automated testing.

dest1n1s · 2024-07-02T09:55:26Z

tests/test_HookedTransformer.py

The filename should be in snake_case. Besides, personal paths such as /remote-home/share/models/llama3_hf/Meta-Llama-3-8B should not be included.

dest1n1s · 2024-07-02T10:01:33Z

tests/test_flash_attn.py

Remove debug codes and personal configs in this file. This test seems too bloated. Can it be broken into several fine-grained unit tests?

Besides, I think tests of HookedTransformer should be put inside the TransformerLens module since we may push these enhancements upstream in the future.

…ttps://github.com/OpenMOSS/Language-Model-SAEs into 11-proposal-accelerate-inference-in-transformerlens

dest1n1s · 2024-07-02T14:16:21Z

tests/test_flash_attn.py

+import pytest
+
+HOOK_SUFFIX={"mlp":"hook_mlp_out", "self_attn":"hook_attn_out", "resid":"hook_resid_post"}
+model_name = 'meta-llama/Meta-Llama-3-8B'


Using LLaMA for automated testing is impractical since its model weight requires authorization. Besides, we don't really need pre-trained weights to validate the correctness of flash attention. Consider changing to a toy transformer model with a random initialized weight.

dest1n1s · 2024-07-02T14:22:31Z

tests/test_flash_attn.py

+
+HOOK_SUFFIX={"mlp":"hook_mlp_out", "self_attn":"hook_attn_out", "resid":"hook_resid_post"}
+model_name = 'meta-llama/Meta-Llama-3-8B'
+model_path = 'path/to/model'


Tests should be able to run automatically in an environment other than your device, meaning device-specific personal paths and placeholders waiting for users to fill are both unacceptable.

dest1n1s · 2024-07-02T14:24:52Z

tests/test_flash_attn.py

+
+@pytest.fixture
+def prepare_config():
+    cfg = LanguageModelConfig.from_flattened(dict(


Why is LanguageModelConfig required to test HookedTransformer?

dest1n1s · 2024-07-02T14:26:01Z

tests/test_flash_attn.py

+    test_input_list = []
+    for _ in range(10):
+        text = ''.join(next(iter(dataloader))['text'])
+        idx = random.randrange(0, len(text)-64)


Formatting issue: operators should be wrapped with spaces.

dest1n1s · 2024-07-02T14:27:18Z

tests/test_flash_attn.py

+
+                delta_max_fa_no = torch.abs(fa_value.cpu() - no_value.cpu()).max().item()
+                delta_max_hf_fa_no = torch.abs(hf_fa_value.cpu() - hf_no_value).max().item()
+                logging.warning(f"L{layer}{abbr}\ttl:{delta_max_fa_no}\thf:{delta_max_hf_fa_no}")


This warning seems to be always running. Why is this a warning?

dest1n1s · 2024-07-02T14:29:42Z

TransformerLens/tests/unit/components/test_abstract_attention_flash_attn.py

+    'llama3-instruct':'meta-llama/Meta-Llama-3-8B-Instruct',
+}
+MODEL_PATHS = {
+    'gpt2':'path/to/gpt2',


Same issues to test_flash_attn.py: do not use real-world model for testing.

dest1n1s · 2024-07-02T14:31:13Z

tests/test_flash_attn.py

Flash attention is a property of HookedTransformer only. Consider moving this into the TransformerLens module.

dest1n1s · 2024-07-02T14:32:57Z

tests/test_flash_attn.py

+d_model = 4096
+
+@pytest.fixture
+def dataset():


Real-world datasets are also unnecessary for testing. Some curated token input should be enough.

…ion`; test with toy attention model

StarConnor · 2024-07-03T09:14:05Z

Move it to TransformerLens/tests/integration/test_flash_attn.py and test with toy attention model

StarConnor added 7 commits June 26, 2024 22:50

flash-attn update

0d1220b

use flash-attn source func instead to accomodate llama3 gqa

a916e86

update conftest.py (cmd option updated)

38af520

update pytest file of testing flash attention and attention_mask supp…

e6cec69

…ort for flash_attn

Merge branch '11-proposal-accelerate-inference-in-transformerlens' of h…

20a432d

…ttps://github.com/OpenMOSS/Language-Model-SAEs into 11-proposal-accelerate-inference-in-transformerlens

delete real path

b85c98e

Merge branch 'main' of https://github.com/OpenMOSS/Language-Model-SAEs …

a12cc22

…into 11-proposal-accelerate-inference-in-transformerlens

StarConnor linked an issue Jun 29, 2024 that may be closed by this pull request

[Proposal] Accelerate Inference in TransformerLens #11

Closed

1 task

StarConnor requested a review from dest1n1s June 29, 2024 08:30

StarConnor added 3 commits July 1, 2024 15:40

changed to install flash-attn by users

5447661

Merge branch 'main' of https://github.com/OpenMOSS/Language-Model-SAEs …

90a3361

…into 11-proposal-accelerate-inference-in-transformerlens

add flash-attn configuration in examples run files

310193e

dest1n1s reviewed Jul 2, 2024

View reviewed changes

pyproject.toml Outdated

implicit_optional=true

[build-system]

Copy link

Collaborator

dest1n1s Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain why these requirements are necessary.

dest1n1s reviewed Jul 2, 2024

View reviewed changes

StarConnor changed the title ~~11 proposal accelerate inference in transformerlens~~ Accelerate Inference in TransformerLens Jul 2, 2024

StarConnor added 2 commits July 2, 2024 22:03

fix and clean the code after first review

c8c86bd

Merge branch '11-proposal-accelerate-inference-in-transformerlens' of h…

24ac042

…ttps://github.com/OpenMOSS/Language-Model-SAEs into 11-proposal-accelerate-inference-in-transformerlens

StarConnor requested a review from dest1n1s July 2, 2024 14:08

dest1n1s reviewed Jul 2, 2024

View reviewed changes

Update test_flash_attn.py; move it to `TransformerLens/tests/integrat…

b78b52f

…ion`; test with toy attention model

StarConnor requested a review from dest1n1s July 3, 2024 09:09

dest1n1s approved these changes Jul 3, 2024

View reviewed changes

dest1n1s merged commit 0e3d268 into main Jul 3, 2024
1 check passed

dest1n1s deleted the 11-proposal-accelerate-inference-in-transformerlens branch July 3, 2024 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate Inference in TransformerLens #26

Accelerate Inference in TransformerLens #26

StarConnor commented Jun 29, 2024 •

edited

Loading

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

dest1n1s Jul 2, 2024

StarConnor commented Jul 3, 2024

	if self.cfg.use_flash_attn:
	# use FlashAttentionV2 to accelerate inference. self.hook_attn_scores, self.hook_pattern, self.hook_z are not supported in this case.
	# Contains at least one padding token in the sequence
	causal = True if self.cfg.attention_dir == "causal" else False
	if attention_mask is not None:
	batch_size, query_length, _ = q.shape
	query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
	q, k, v, attention_mask, q.shape[1]
	)

	cu_seqlens_q, cu_seqlens_k = cu_seq_lens
	max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens

	attn_output_unpad = flash_attn_varlen_func(
	query_states,
	key_states,
	value_states,
	cu_seqlens_q=cu_seqlens_q,
	cu_seqlens_k=cu_seqlens_k,
	max_seqlen_q=max_seqlen_in_batch_q,
	max_seqlen_k=max_seqlen_in_batch_k,
	causal=causal,
	)

	z = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
	else:
	z = flash_attn_func(q, k, v, causal=causal)

Accelerate Inference in TransformerLens #26

Accelerate Inference in TransformerLens #26

Conversation

StarConnor commented Jun 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StarConnor commented Jul 3, 2024

StarConnor commented Jun 29, 2024 •

edited

Loading