Grok-1 optimization #164

sogalin · 2024-09-03T14:51:16Z

This PR adds GROK-FP8 support in vLLM.

…iling code and fix the gemm shape could not be dumped correctly in multiple-gpu

…fused_moe accuracy check test file 4) sync optimization from branch MLPerf-4.1

* First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover

* remove scoping * while there fix a typo * while there remove unused variable

…the different TP setting

…lue is wrong

hegemanjw4amd · 2024-09-03T16:42:03Z

benchmarks/benchmark_latency+accuracy_check.py

Would you mind renaming this file to not use a '+' symbol?

maleksan85 · 2024-09-03T18:17:05Z

CMakeLists.txt

@@ -328,7 +332,7 @@ if (VLLM_PUNICA_GPU_ARCHES)
    DESTINATION vllm
    LANGUAGE ${VLLM_GPU_LANG}
    SOURCES ${VLLM_PUNICA_EXT_SRC}
-    COMPILE_FLAGS ${VLLM_PUNICA_GPU_FLAGS}
+    COMPILE_FLAGS ${VLLM_PUNICA_GPU_eLAGS}


rasmith · 2024-09-03T16:37:15Z

tests/kernels/test_moe_standalone.py

+    #triton_col_output = fused_moe_col_major(a, w1, w2, score, topk, renormalize=False)
+    #print(f"triton_col_output: {triton_col_output}")
+
+    assert torch.allclose(triton_output, torch_output, atol=1e-2, rtol=0)


Please use torch.testing.assert_close

rasmith · 2024-09-03T16:52:06Z

csrc/attention/attention_kernels.cu

@@ -293,6 +293,8 @@ __device__ void paged_attention_kernel(
      // This includes a reduction across the threads in the same thread group.
      float qk = scale * Qk_dot<scalar_t, THREAD_GROUP_SIZE>::dot(
                             q_vecs[thread_group_offset], k_vecs);
+      float max_attn_val = 30.0; //hardcoded for grok


Does this affect any other model? Can this be moved to a #define or constexpr or something?

rasmith · 2024-09-03T16:53:29Z

csrc/attention/attention_kernels.cu

@@ -998,4 +1000,4 @@ void paged_attention_v2(
 #undef WARP_SIZE
 #undef MAX
 #undef MIN
-#undef DIVIDE_ROUND_UP


Does this affect anything else? Still need everything else to work.

rasmith · 2024-09-03T17:02:09Z

CMakeLists.txt

@@ -328,7 +332,7 @@ if (VLLM_PUNICA_GPU_ARCHES)
    DESTINATION vllm
    LANGUAGE ${VLLM_GPU_LANG}
    SOURCES ${VLLM_PUNICA_EXT_SRC}
-    COMPILE_FLAGS ${VLLM_PUNICA_GPU_FLAGS}
+    COMPILE_FLAGS ${VLLM_PUNICA_GPU_eLAGS}


This looks to be a typo, probably does not build.

sogalin · 2024-09-12T14:45:11Z

Thanks for the feedback, we will correct in our new PR "#181" based on the vllm 0.6.0.

wunhuang and others added 24 commits September 3, 2024 07:18

Support grok1 model

1a6a1e9

fix config class not found issue

a3b34f6

add fp8 support (still debugging)

88ef830

fix the weight name mapping, but got failed in triton part

270beea

change the param_type from torch.float8_e4m3fn to torch.float8_e4m3fnuz

d15a3c9

workaround "fp8e4nv data type is not supported on CUDA"

f7cca0c

do vectorized load and store in scaled_fp8_quant_kernel, Add rpd prof…

4e4de7b

…iling code and fix the gemm shape could not be dumped correctly in multiple-gpu

1) Add TP8 fused_moe config 2) Add accuracy check script file 3) add …

c4e9206

…fused_moe accuracy check test file 4) sync optimization from branch MLPerf-4.1

Add extra label in grok1 model to debug bubble issues

d66a1d7

add ck group gemm support

d904a2d

fix the accuracy problem

060a89d

support LDS bypass feature for fused_moe

286210a

Change padding size to 256 for fp8

01072a9

Revise benchmark_moe_rocm.py for more cases tunning.

d47382d

Change tunning config for LDS bypass optimization for FP8

036f294

Optimize custom all reduce (ROCm#130)

2b7a776

* First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover

Make CAR ROCm 6.1 compatible. (ROCm#137)

fd8f821

* remove scoping * while there fix a typo * while there remove unused variable

Add fused_moe configuration files

86a5ef3

rms_layernorm opt by Jacob

10684dd

Add condition to adjust permute N dim size for better performance in …

f384516

…the different TP setting

Fix CAR build error issue

6f24903

Add some benchmark and unit-test files

1e4a3ec

Add tuning file in the different hipblaslt version

2091e45

Optimize rms_norm kernel with vec8 and fix fused_moe configuration va…

974c168

…lue is wrong

hegemanjw4amd reviewed Sep 3, 2024

View reviewed changes

benchmarks/benchmark_latency+accuracy_check.py

Copy link

hegemanjw4amd Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind renaming this file to not use a '+' symbol?

maleksan85 reviewed Sep 3, 2024

View reviewed changes

rasmith reviewed Sep 3, 2024

View reviewed changes

sogalin closed this Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grok-1 optimization #164

Grok-1 optimization #164

sogalin commented Sep 3, 2024

hegemanjw4amd Sep 3, 2024

maleksan85 Sep 3, 2024

rasmith Sep 3, 2024

rasmith Sep 3, 2024

rasmith Sep 3, 2024

rasmith Sep 3, 2024

sogalin commented Sep 12, 2024 •

edited

Loading

Grok-1 optimization #164

Grok-1 optimization #164

Conversation

sogalin commented Sep 3, 2024

hegemanjw4amd Sep 3, 2024

Choose a reason for hiding this comment

maleksan85 Sep 3, 2024

Choose a reason for hiding this comment

rasmith Sep 3, 2024

Choose a reason for hiding this comment

rasmith Sep 3, 2024

Choose a reason for hiding this comment

rasmith Sep 3, 2024

Choose a reason for hiding this comment

rasmith Sep 3, 2024

Choose a reason for hiding this comment

sogalin commented Sep 12, 2024 • edited Loading

sogalin commented Sep 12, 2024 •

edited

Loading