Homepage: https://icml.cc/Conferences/2024
- Serving LLMs
- HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment [Personal Notes] [arXiv] [Code]
- HKUST & ETH & CMU
- Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree).
- Propose a heuristic-based evolutionary algorithm to search for the optimal layout.
- Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree).
- HKUST & ETH & CMU
- MuxServe: Flexible Spatial-Temporal Multiplexing for LLM Serving [arXiv] [Code]
- CUHK & Shanghai AI Lab & HUST & SJTU & PKU & UC Berkeley & UCSD
- Colocate LLMs considering their popularity to multiplex memory resources.
- APIServe: Efficient API Support for Large-Language Model Inferencing [arXiv]
- UCSD
- HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment [Personal Notes] [arXiv] [Code]
- Benchmark
- Speculative decoding
- Online Speculative Decoding [arXiv]
- UC Berkeley & UCSD & Sisu Data & SJTU
- Online Speculative Decoding [arXiv]
- Video generation
- Image retrieval