Quantize megablox #1062

lenscloth · 2024-11-25T19:49:45Z

Description

Support quantization on megablox

Current implementation can accelerate training but not serving, serving acceleration will be done in another PR.
Tested correctness after apply quantization on megablox.

Tests

End to end tests on mixtral-8x22b model with max engine.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

…load quantized weights from AqtEinsum to feed to gmm kernel later.

RissyRan · 2024-11-26T00:53:01Z

We shouldn't add megablox as a copy from JAX repo into MaxText. We are good to work in a dev branch, but in a long term, we need to merge related quantization changes into JAX repo. So MaxText could directly import it.

By accepting pre-quantized weight, quantized gmm does not need to quantize weight for every iteration.

RissyRan · 2024-12-04T02:00:01Z

We shouldn't add megablox as a copy from JAX repo into MaxText. We are good to work in a dev branch, but in a long term, we need to merge related quantization changes into JAX repo. So MaxText could directly import it.

Discussed with JAX team, and they prefer we fork the megablox kernel to avoid dependency issue to use standalone install of JAX. So we will have a "copy" of gmm implementation in MaxText. Thanks Wonpyo!

Megablox now supports: - int8 quantization - int8w quantization - int4w quantization

lenscloth · 2024-12-04T22:14:00Z

@RissyRan This branch now supports setting different precision for lhs and rhs. Now it support

Int8
Int8w
Int4w

RissyRan

Thank you Wonpyo!

Created a diff gmm file for future reference.

MaxText/kernels/megablox/common.py

MaxText/kernels/megablox/__init__.py

MaxText/kernels/megablox/gmm.py

MaxText/layers/linears.py

RissyRan · 2024-12-05T04:34:58Z

MaxText/layers/linears.py

@@ -389,7 +395,14 @@ def unpermute(self, intermediate, sorted_selected_experts, weights):
          reshaped_weights.astype(jnp.float32),
          precision=matmul_precision,
      )
-    return output.reshape(-1, self.config.max_target_length, self.config.emb_dim // tensor_parallelism).astype(self.dtype)
+    updated_batch = int(self.config.per_device_batch_size * jax.device_count() // self.config.ici_fsdp_parallelism)


An assertion may be needed if the value is indivisible cc @mailvijayasingh @ZhiyuLi-goog

MaxText/layers/linears.py

MaxText/kernels/megablox/gmm.py

RissyRan

Others LGTM! @ZhiyuLi-goog could you also help take a look once you are back?

lenscloth · 2024-12-05T21:34:10Z

Resolved some of comments @RissyRan would you do review one more time?

ZhiyuLi-goog · 2024-12-05T22:40:40Z

Thank you @lenscloth for the awesome PR. LGTM!

MaxText/layers/linears.py

RissyRan

Thank you! For code style check, you could workaround using this.

MaxText/kernels/megablox/gmm.py

RissyRan

lgtm!

gobbleturk

Looks great! Only request to get sizes based on inputs instead of config (ideally we can remove the original reference to tensor_parallelism as well)

MaxText/kernels/__init__.py

gobbleturk · 2024-12-10T00:22:34Z

MaxText/layers/linears.py

@@ -404,7 +403,14 @@ def unpermute(self, intermediate, sorted_selected_experts, weights):
          reshaped_weights.astype(jnp.float32),
          precision=matmul_precision,
      )
-    return output.reshape(-1, self.config.max_target_length, self.config.emb_dim // tensor_parallelism).astype(self.dtype)
+    updated_batch = int(self.config.per_device_batch_size * jax.device_count() // self.config.ici_fsdp_parallelism)


Does this cover all cases - what if we use data_parallelism instead of FSDP (or a mix of them) or even fsdp_transpose?

Can we get the relevant sizes from the input shape instead of based on the config?

@ZhiyuLi-goog made change there. Would you take a look on this?

Thank you for the comment @gobbleturk

Does this cover all cases - what if we use data_parallelism instead of FSDP (or a mix of them) or even fsdp_transpose?

Updated to cover data and fsdp (or a mix of them). Not yet fsdp_transpose since it is simply not used in MoE models.

Can we get the relevant sizes from the input shape instead of based on the config?

It seems not easy to have a relevant sizes solution covering both prefill and decoding situations and the prefill length might be variant. And this is the trick working pretty well in our inferencing experiments.

@RissyRan @mailvijayasingh feel free to chime in if you have new ideas.

Can we get the relevant sizes from the input shape instead of based on the config?

Yeah, this is the difference between training and inference. The batch size will be different in prefill and generate stage (even we specify per_device_batch_size as a fixed value). In megablox, we need extra permute and unpermute operations, and those need to be calculated manually to map back the right shape.

@gobbleturk what's your suggestion on this? basically we need an indicator to see if current is prefill or generate stage in inference, and then reshape the right shape, based on different batch size.

I have updated this to inference output shape based on input tensor.
But I am not fully familiar with this code, and afraid I might made a wrong implementation here.

@ZhiyuLi-goog Can you take a look on this.
Also @gobbleturk Can you review this branch? It would be great to merge this branch before holiday.

cc. @RissyRan

I was thinking something like sharded_batch_shape = inputs.shape[0]. I think prefill and generate are two separate jits (unsure) so this sharded_batch_shape just takes exactly the value it needs

gobbleturk

LGTM, Thanks Wonpyo!

…ing them from how process indexes changed across restarts with some false assumptions. PiperOrigin-RevId: 700737164

…e Python layer overhead

This mode is the same as nightly except that after nightly is installed, any file in `maxtext/*.whl` is forcefully reinstalled.

PiperOrigin-RevId: 703063210

ZhiyuLi-goog and others added 8 commits November 13, 2024 00:48

base

61d944d

fix import error in pallas megablox kernel

3d25eb3

Quantize GMM kernel

4a3d9ba

[MoE] fix typo

2ebe80e

walk around quantize_params

0e59567

inferencing shape hack

bd208ed

Enable (1) checkpoint conversion for MoEBlock when megablox=True (2) …

57e2383

…load quantized weights from AqtEinsum to feed to gmm kernel later.

Fix lint error

9117c40

lenscloth requested review from gobbleturk, jonb377, khatwanimohit, bvandermoon and vipannalla as code owners November 25, 2024 19:49

lenscloth added 2 commits November 25, 2024 21:24

[rollback] quantize param when ckpt is not quantized.

35e243a

Merge branch 'quantize_megablox'

94bdc94

Let gmm accept pre-quantized rhs (weight).

fa5d5f3

By accepting pre-quantized weight, quantized gmm does not need to quantize weight for every iteration.

Support different quantization precision for lhs / rhs of megablox.

11f1f43

Megablox now supports: - int8 quantization - int8w quantization - int4w quantization

RissyRan reviewed Dec 5, 2024

View reviewed changes

lenscloth added 3 commits December 5, 2024 21:18

Fix license

1844936

Refactoring MoEBlock

0d9db17

Rename in_out_block_spec for better readability

cd8b2d1

ZhiyuLi-goog approved these changes Dec 5, 2024

View reviewed changes

RissyRan reviewed Dec 6, 2024

View reviewed changes

MaxText/layers/linears.py Outdated Show resolved Hide resolved

MaxText/layers/linears.py Outdated Show resolved Hide resolved

RissyRan reviewed Dec 6, 2024

View reviewed changes

MaxText/kernels/megablox/gmm.py Outdated Show resolved Hide resolved

RissyRan approved these changes Dec 9, 2024

View reviewed changes

gobbleturk requested changes Dec 10, 2024

View reviewed changes

gobbleturk approved these changes Dec 14, 2024

View reviewed changes

lenscloth and others added 22 commits December 14, 2024 01:09

Read lhs and rhs quantization dtype directly from DotGeneral

325c4c3

Fix MoE related tests

95bb7c1

Add checkpoint topology discovery for the Replicator Service

7addc0f

Fix local restore by re-mapping device ids directly instead of inferr…

f6c7c5a

…ing them from how process indexes changed across restarts with some false assumptions. PiperOrigin-RevId: 700737164

Compact the number of variables for the prefill result cache to reduc…

dca3ee5

…e Python layer overhead

add more MoE tests

c993cf2

update setup_gcsfuse for better perf

86d0b54

Assert multiple slices available when requesting DCN parallelisms

346218a

Add llama 3.1 70b config

f6c38f1

Update replicator.yaml to include framework and num_slices information

7a4a44e

Support JAX_VERSION for nightly mode on GPU

e3823c0

Added a custom_wheel mode for building the dependency image.

e17fcc6

This mode is the same as nightly except that after nightly is installed, any file in `maxtext/*.whl` is forcefully reinstalled.

clean up pipeline config setting in its own method

4cdc15d

Fixes for dropping

7390f56

Fix setup_training_state

5831547

point to new jax github location in documentation

90430e1

PiperOrigin-RevId: 703063210

Change awk regex command to capture the coordinate address properly

2d1c51a

Fixes non-hashable error in ragged attn.

53a6abe

Add new remat policy for save_dot_except_mlp with context

4546d68

Fix moe logging to differentiate dense and megablox runs

945ee3d

update sharding

2b62fe8

Reshape based on the original input shape

742019f

lenscloth force-pushed the quantize_megablox branch from ae9a68f to 742019f Compare December 14, 2024 01:14

lenscloth and others added 3 commits December 14, 2024 01:20

Resolve math import error

91b84f6

Merge branch 'main' into quantize_megablox

cd8e6bf

fix lint error

b0f21f9

lenscloth mentioned this pull request Dec 16, 2024

Apply quantization on megablox kernel; support both training and serving #1100

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantize megablox #1062

Quantize megablox #1062

lenscloth commented Nov 25, 2024 •

edited

Loading

RissyRan commented Nov 26, 2024

RissyRan commented Dec 4, 2024

lenscloth commented Dec 4, 2024

RissyRan left a comment

RissyRan Dec 5, 2024

RissyRan left a comment

lenscloth commented Dec 5, 2024

ZhiyuLi-goog commented Dec 5, 2024

RissyRan left a comment

RissyRan left a comment

gobbleturk left a comment

gobbleturk Dec 10, 2024

gobbleturk Dec 10, 2024

lenscloth Dec 10, 2024

ZhiyuLi-goog Dec 10, 2024

RissyRan Dec 11, 2024

lenscloth Dec 12, 2024

gobbleturk Dec 14, 2024

gobbleturk left a comment

Quantize megablox #1062

Are you sure you want to change the base?

Quantize megablox #1062

Conversation

lenscloth commented Nov 25, 2024 • edited Loading

Description

Tests

Checklist

RissyRan commented Nov 26, 2024

RissyRan commented Dec 4, 2024

lenscloth commented Dec 4, 2024

RissyRan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RissyRan left a comment

Choose a reason for hiding this comment

lenscloth commented Dec 5, 2024

ZhiyuLi-goog commented Dec 5, 2024

RissyRan left a comment

Choose a reason for hiding this comment

RissyRan left a comment

Choose a reason for hiding this comment

gobbleturk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gobbleturk left a comment

Choose a reason for hiding this comment

lenscloth commented Nov 25, 2024 •

edited

Loading