Releases: ggerganov/llama.cpp
Releases · ggerganov/llama.cpp
b4202
b4201
Add some minimal optimizations for CDNA (#10498) * Add some minimal optimizations for CDNA * ggml_cuda: set launch bounds also for GCN as it helps there too
b4200
ci : faster CUDA toolkit installation method and use ccache (#10537) * ci : faster CUDA toolkit installation method and use ccache * remove fetch-depth * only pack CUDA runtime on master
b4195
vulkan: Handle GPUs with less shared memory (#10468) There have been reports of failure to compile on systems with <= 32KB of shared memory (e.g. #10037). This change makes the large tile size fall back to a smaller size if necessary, and makes mul_mat_id fall back to CPU if there's only 16KB of shared memory.
b4191
ci : fix cuda releases (#10532)
b4179
ggml-cpu: cmake add arm64 cpu feature check for macos (#10487) * ggml-cpu: cmake add arm64 cpu feature check for macos * use vmmlaq_s32 for compile option i8mm check
b4178
server : fix parallel speculative decoding (#10513) ggml-ci
b4177
speculative : simplify the implementation (#10504) ggml-ci
b4176
CANN: Improve the Inferencing Performance for Ascend NPU Device (#10454) * improve inferencing performance for ascend npu. Co-authored-by: Frank Mai <thxCode@[email protected]> * some modification after review * some modifications after review * restore some modifications * restore some modifications --------- Co-authored-by: shanshan shen <[email protected]> Co-authored-by: Frank Mai <thxCode@[email protected]>
b4175
CANN: RoPE and CANCAT operator optimization (#10488) Co-authored-by: noemotiovon <[email protected]>