forked from flexflow/flexflow-train
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge inference into BertMLM_fix #2
Open
xinhaoc
wants to merge
401
commits into
xinhao_candle
Choose a base branch
from
xinhao_inference
base: xinhao_candle
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* fix hip_rocm build with sentencepiece * shellcheck 1 * shellcheck 2 * shellecheck 3 * fix install script * .github/workflows/helpers/install_dependencies.sh * fix * shellcheck * restore unnecessary changes * fix build * removed outdated test from c++ tests * update link in readme
* implemented file-based configs, remove spec_pipeline folder * fix * add inference test, script to downlaod weights * update readme * update ci scripts * newlines * fix gpu-ci * fix * fix * update test file * added incr decoding program, moved LLAMA folder from examples * linting * add incremental decoding to test * update readme * add script to download opt weights * fix support for opt, move code to root inference folder * linting * update test file * fix * bug fix * update test
…exflow#736) * making TreeIncMultiHeadSelfAttentionMeta a subclass of IncMultiHeadSelfAttentionMeta * make BeamSearchIncMultiHeadAttentionMeta a subclass of IncMultiHeadAttentionMeta * format * merging kernel functions * merge more functions * merge compute_qkv_kernel * format * fix config --------- Co-authored-by: xinhaoc <[email protected]>
* fix alignment bugs (part 1) * add missing file
…ttention (flexflow#737) * making TreeIncMultiHeadSelfAttentionMeta a subclass of IncMultiHeadSelfAttentionMeta * make BeamSearchIncMultiHeadAttentionMeta a subclass of IncMultiHeadAttentionMeta --------- Co-authored-by: xinhaoc <[email protected]>
* save output to file * add alignment tests * fix * change conflicting name, add comments * fix typo * formatting * more comments and clean dead code * formatting * fixed issue with length mismatch * fix ci skip * update inf test * add precision selection support in incr decoding
* Update README.md * update readme * fix
…d tests (flexflow#749) * add support for downloading mixed precision llama/opt weights * fix * update test script to also run half precision tests * disable workflow for inference PRs * add verbose option * linting * copy opt weights in download weights script * add alignment tests with huggingface (llama) * fix, add diff to test script * fix * add opt tests * comment out tests not passing * add e2e latency to output files * add speed tests * shellcheck * shellcheck * fix * fix * linting * fix
* Add support for login information with multiple ssms. * Update prepare_next_batch_verify. * Add dedup tree merge. * Format. * Fix bugs. * Runs with mutilmodels. * Fix. * Format * Fix. * Fix increamental decoding. * fix use_full_precision issue.
* fix * fix workflow
* Fix bug in elementwise multiplication with broadcasting (flexflow#764) * Fix multinode test (flexflow#766) * Fix UCX multinode test (flexflow#768) * fix * fix 2 * Prevent format.sh from formatting triton (flexflow#756) * [CI] - Increase timeout in multinode test (UCX & MPI) (flexflow#773) * fix * fix 2 * increase timeout * Fix docker builds in CI (flexflow#774) --------- Co-authored-by: Soumya Chatterjee <[email protected]> Co-authored-by: Colin Unger <[email protected]>
* init * add mlc tokenizer. * . * fix * fix pipeline, fix name * . * format * ci * . * add rust * fix * . * inf test fix * . * fix * . * fix * optimize * move rust to conda env * . * . * fix * fix * fix * update git ignore * fix rust install * Update config.linux --------- Co-authored-by: Gabriele Oliaro <[email protected]>
* fix gpu-ci * add check for rust in cmake
* decomp * initial implementation * add missing file * checkpoint * more bug fixes * update default offload size * fix non-offload * undo changes to spec_inc_mha * fix a parallel tensor reuse bug * prepare_next_batch for offload(inc_decode) * format * int4&int8 offload * fix merge issue * fix build * spec_infer offload&quantize * fix, update readme. * remove redundant * hip build * hip * model param --------- Co-authored-by: xinhaoc <[email protected]>
* add parallel operators * add cmd line param * setting machine views * move bias blocks * comment out print of partitions * add unimplemented methods * add impl of inference functions to replicate and reduce ops * replicate bias in file loader * fixes, now works * only add bias once * load and use weights according to partition * fix wout weight * cleanup * add support for mixed precision in parallel ops * cleanup * rocm build fix * hip rocm fix 2 * fix machine views * fix rocm build * adjust numbe of pipeline stages * add model parallelism to opt linear layers * fix * fxi multi gpu test * fix * add tensor parallelism tests to inference test script * enable tensor parallelism for dense layers in llama * fix * fix set_tensor-related issues * fix and linting
* Docker-build and Publish Modification **Description of changes:** Add code in docker-build.yml that allows automatic build and publish process when push happens to inference branch. Moreover, modifies publish.sh so that image name will be created as "image" and "branch" name to distinguish from those created in master branch. **Related Issues:** Linked Issues: - Issue # Issues closed by this PR: - Closes # **Before merging:** - [ ] Did you update the [flexflow-third-party](https://github.com/flexflow/flexflow-third-party) repo, if modifying any of the Cmake files, the build configs, or the submodules? * update container name * specinfer env publish * tag specinfer * add spaces * newline * fix * fix gpu ci workflow --------- Co-authored-by: Gabriele Oliaro <[email protected]>
* fix linear region requirement * fix set tensor issue
Update links/names of docker container from flexflow-{cuda, hip_rocm} to specinfer-{cuda, hip_rocm} with the disclaimer of CUDA version. Co-authored-by: Gabriele Oliaro <[email protected]>
* bug fixes and update Legion version * fix * bug fix * update legion * fix arithmetic error due to num_devices uninitialized * update legion version * update ci * fix * debugging ci * Revert "debugging ci" This reverts commit 0b3148e. --------- Co-authored-by: Gabriele Oliaro <[email protected]>
…w#1246) * add a background server for RequestManager * . * make incr_decoding work * make spec_infer work * format * update python inference * fix python issues * bug fix * add a Legion future to capture the termination of the background server * gradio finished * chatbot gradio version 2 * chainlit1 * chainlit2 * fastapi done * fastapi incr_decoding * langchain example & wrapper class * langchain example & wrapper class1 * added documentation * entrypoint * del apikey * delete extra files * rag search fixed some bugs * fixed rag search issues * updates before rebase * minor changes * reorganize files * Add thread safety for background server. * Simplify backend server design. * resolve conflict. * specinfer usecases with issues labeled * specinfer usecases with issues labeled 2 * fixed issues with prompt template * fix issues with rag specinfer * Add server task timeout. * register callbacks to terminate background worker at exit or termination * [Python] enable decoding multiple requests * update README.md and default configuration * fix issues with gradio and prompt template * fix issues with rag * adjusted fastapi entrypoint * update documentation * resole conflicts * issues fix * adjustments on usecases and api entrypoints * remove redundent changes * testing CI * Enable backtrace * restore newlines * version * add back misdeleted line * legion verion --------- Co-authored-by: Zhihao Jia <[email protected]> Co-authored-by: Gabriele Oliaro <[email protected]> Co-authored-by: zwang86 <[email protected]> Co-authored-by: Zeyu Wang <[email protected]> Co-authored-by: xinhaoc <[email protected]>
* bug fixes and update Legion version * fix * bug fix * update legion * fix arithmetic error due to num_devices uninitialized * update legion version * update ci * fix * debugging ci * Revert "debugging ci" This reverts commit 0b3148e. * update mapper interface * add ncclFinalize * Only delete nccl communications for training jobs --------- Co-authored-by: Zhihao Jia <[email protected]>
* modify README * fix link issues * update legion version --------- Co-authored-by: Zhihao Jia <[email protected]>
* . * remove deadcode * add benchmarking mode, initializing weights randomly * better logging when running out of memory * update --------- Co-authored-by: Gabriele Oliaro <[email protected]>
Co-authored-by: Gabriele Oliaro <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes:
Related Issues:
Linked Issues:
Issues closed by this PR:
Before merging: