Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom AllReduce #1467

Merged
merged 54 commits into from
Aug 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
7e02ef0
feat: update deps/flashinfer
chenzhuofu Jun 29, 2024
be64b3b
feat: update flashinfer
chenzhuofu Jun 29, 2024
7a4543b
fix: now can get correct result, but has performance problem
chenzhuofu Jun 29, 2024
e1c8145
fix: update_custom_mask performance
chenzhuofu Jul 1, 2024
483b71b
chore: minor
chenzhuofu Jul 1, 2024
0a117fb
chore: add perf code
chenzhuofu Jul 1, 2024
ac343ee
Merge branch 'specscheduler' into specscheduler-new-attention
zikun-li Jul 3, 2024
3e9dc00
feat: add attention metadata
chenzhuofu Jul 4, 2024
a498e6f
feat: add AttentionMetaData
chenzhuofu Jul 4, 2024
2903996
feat: tree_verify_attn use global attentionmetadata
chenzhuofu Jul 4, 2024
fed9e5c
feat: move attentionmetasize to global computing
chenzhuofu Jul 4, 2024
7c64f36
chore: minor
chenzhuofu Jul 4, 2024
37d3e3d
chore: remove unused
chenzhuofu Jul 5, 2024
9940ff5
feat: add spec_inc_attn backup
chenzhuofu Jul 5, 2024
5262af7
feat: SSM use flashinfer kernel
chenzhuofu Jul 5, 2024
dfe4bec
fix: SSM don't use cudaGraph
chenzhuofu Jul 5, 2024
e08f06e
chore: remove redundant code
chenzhuofu Jul 5, 2024
ed544e9
chore: comment out minor
chenzhuofu Jul 7, 2024
214aed6
feat: attention adapt to cudaGraph
chenzhuofu Jul 8, 2024
020129b
fix: split handler_collections for prompt/decode phases
chenzhuofu Jul 8, 2024
0150501
chore: tree verify cannot use cudaGraph
chenzhuofu Jul 8, 2024
d4d66af
feat: move all flashinfer-related states to global (tree search atten…
chenzhuofu Jul 9, 2024
b4cc53b
fix: use identical attention_meta instance across all FFHandlers
chenzhuofu Jul 9, 2024
79ea1d9
feat: enable cudaGraph in tree search mode
chenzhuofu Jul 9, 2024
8d52e4f
chore: minor
chenzhuofu Jul 9, 2024
2afb66c
feat: tree search & verify use separate attention_meta
chenzhuofu Jul 10, 2024
74e166f
fix: attention_metadata should be distinct for each worker
chenzhuofu Jul 10, 2024
252a5c4
feat: tree verify attention use metadata
chenzhuofu Jul 10, 2024
9fda3b6
feat: support llm cudaGraph
chenzhuofu Jul 10, 2024
6f88134
chore: minor
chenzhuofu Jul 10, 2024
dca40b8
chore: temporally only enable SSM cudaGraph for performance issue
chenzhuofu Jul 10, 2024
0e5ec41
chore: minor
chenzhuofu Jul 10, 2024
dd4d6d0
fix: llm cudaGraph, should ensure the kernel parameter be consistent
chenzhuofu Jul 28, 2024
18fee1c
feat: reduce cudaGraph number
chenzhuofu Jul 28, 2024
b17c5cb
feat: reduce cudaGraph instances number
chenzhuofu Jul 28, 2024
428875c
feat: add tensorRT-LLM custom_allreduce
chenzhuofu Jul 29, 2024
840da50
feat: add tensorrt_llm custom_allreduce kernel into exeutable
chenzhuofu Jul 31, 2024
5f16bd4
doc: add a README for acknowledgement
chenzhuofu Jul 31, 2024
3d50053
feat: add device info in FFHandle
chenzhuofu Aug 1, 2024
273dfc7
feat: enable both cudaGraph
chenzhuofu Aug 2, 2024
d201b2e
feat: temporally add the ipc mem
chenzhuofu Aug 2, 2024
1ba38a8
feat: enable only ssm cudaGraph
chenzhuofu Aug 4, 2024
7b235ef
feat: minor reconstruct
chenzhuofu Aug 4, 2024
f3c9629
feat: implementation of CommunicationBuffer
chenzhuofu Aug 4, 2024
462e0b7
feat: implement custom_allreduce
chenzhuofu Aug 5, 2024
567e165
feat: allocate memory from legion, not cudaMalloc
chenzhuofu Aug 9, 2024
a2fb367
chore: some debug output
chenzhuofu Aug 11, 2024
4134013
feat: switch to use peer memory, rather than IPC memory
chenzhuofu Aug 13, 2024
c094981
chore: remove debug output
chenzhuofu Aug 13, 2024
95ca71c
fix: minor concurrent bug
chenzhuofu Aug 13, 2024
55a7942
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Aug 14, 2024
b9760c0
style: format code
chenzhuofu Aug 14, 2024
102dd38
chore: remove unused backup code
chenzhuofu Aug 14, 2024
0219b78
chore: more measurements
chenzhuofu Aug 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,12 @@ if(NOT BUILD_LEGION_ONLY)
LIST_DIRECTORIES False
${FLEXFLOW_ROOT}/src/*.cu)

# tensorrt_llm custom allreduce
if(FF_USE_NCCL)
list(APPEND FLEXFLOW_INCLUDE_DIRS ${CMAKE_CURRENT_SOURCE_DIR}/deps/tensorrt_llm)
list(APPEND FLEXFLOW_GPU_SRC ${CMAKE_CURRENT_SOURCE_DIR}/deps/tensorrt_llm/tensorrt_llm/custom_allreduce_kernels.cu)
endif()

add_compile_definitions(FF_USE_CUDA)

if(BUILD_SHARED_LIBS)
Expand Down
5 changes: 5 additions & 0 deletions deps/tensorrt_llm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Custom AllReduce Implementation

This is an adapted version of the custom AllReduce plugin from NVIDIA's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) repository.

To replace the NCCL AllReduce call, we should also add a CUDA IPC support to the custom AllReduce usage. Our IPC&AllReduce implementation is referenced from [mlc-ai/relax](https://github.com/mlc-ai/relax).
Loading
Loading