Update to b1892 #5

boxbeam · 2024-01-18T21:02:40Z

No description provided.

Fix bug in identifying the grammar.

* build : Check the ROCm installation location * more generic approach * fixup! It was returning the path instead of the command output * fixup! Trailing whitespace

* llama.swiftui : add bench button * llama.swiftui : initial bench functionality * force to use n_gpu_layers on simulator * add download buttons & expose llamaState.loadModel * update project.pbxproj * comment #Preview & fix editorconfig check * gitignore : xcode stuff * llama.swiftui : UX improvements * llama.swiftui : avoid data copy via "downloadTask" * llama.swiftui : remove model from project * llama : remove "mostly" from model infos * llama.swiftui : improve bench --------- Co-authored-by: jhen <[email protected]>

…rganov#4519)

…4490) * phi2 implementation * fix breaking change * phi-2 : various fixes * phi-2 : use layer norm eps * py : whitespaces * llama : fix meta KV override bug * convert : phi don't add BOS token * convert : revert "added_tokens_decoder" change * phi-2 : scale Q instead of KQ for better precision * ggml : fix NeoX rope to rotate just first n_dims * cuda : less diff in the rope_neox kernel * ggml : add ggml_mul_mat_set_prec ggml-ci * Update ggml-cuda.cu Co-authored-by: slaren <[email protected]> * Update ggml-cuda.cu Co-authored-by: slaren <[email protected]> * cuda : ggml_cuda_op_mul_mat_cublas support F32 precision * cuda : remove oboslete comment --------- Co-authored-by: Ebey Abraham <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: slaren <[email protected]>

regression of ggerganov#4490 Adds defines for two new datatypes cublasComputeType_t, cudaDataType_t. Currently using deprecated hipblasDatatype_t since newer ones very recent.

Co-authored-by: Eric Sommerlade <[email protected]>

* CUDA: make MoE tensors contiguous for batch size>1 * Update ggml-cuda.cu Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

…gerganov#4554)

…ganov#4556) * cuda : replace asserts in wrong architecture checks with __trap * make bad_arch noreturn, remove returns

* Update ggml-cuda.cu * Update ggml-cuda.cu * Update ggml-cuda.cu --------- Co-authored-by: Georgi Gerganov <[email protected]>

Otherwise, on Windows converting bling-phi-2-v0 (<https://huggingface.co/llmware/bling-phi-2-v0>) via convert-hf-to-gguf.py will fail with the following error: ``` Traceback (most recent call last): File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 1061, in <module> model_instance.set_vocab() File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 52, in set_vocab self._set_vocab_gpt2() File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 264, in _set_vocab_gpt2 special_vocab = gguf.SpecialVocab(dir_model, load_merges=True) File "C:\Users\User\git\gguf\gguf\vocab.py", line 33, in __init__ self._load(Path(path)) File "C:\Users\User\git\gguf\gguf\vocab.py", line 81, in _load self._try_load_merges_txt(path) File "C:\Users\User\git\gguf\gguf\vocab.py", line 95, in _try_load_merges_txt for line in fp: File "C:\Users\User\miniconda3\envs\gguf\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1415: character maps to <undefined> ```

Regression of 1398823 HIP doesn't have trap, only abort

…#4449) * AMD ROCm: handle UMA memory VRAM expansions This resolves ggerganov#2797 by allowing ROCm AMD GPU users with a UMA to dynamically expand the VRAM allocated to the GPU. Without this, AMD ROCm users with shared CPU/GPU memory usually are stuck with the BIOS-set (or fixed) framebuffer VRAM, making it impossible to load more than 1-2 layers. Note that the model is duplicated in RAM because it's loaded once for the CPU and then copied into a second set of allocations that are managed by the HIP UMA system. We can fix this later. * clarify build process for ROCm on linux with cmake * avoid using deprecated ROCm hipMallocHost * keep simplifying the change required for UMA * cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON

…4540) * allowed getting n_batch from llama_context in c api * changed to use `uint32_t` instead of `int` * changed to use `uint32_t` instead of `int` in `llama_n_ctx` * Update llama.h --------- Co-authored-by: Georgi Gerganov <[email protected]>

* llama : minor fix indent * llama : check LLAMA_TRACE env for extra logging ggml-ci

Co-authored-by: Iwan Kawrakow <[email protected]>

…4951)

…4943)

* speculative: expose draft threading * fix usage format * accept -td and -tbd args * speculative: revert default behavior when -td is unspecified * fix trailing whitespace

This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h. Signed-off-by: Daniel Bevenius <[email protected]>

This change makes it possible to build ggml-cuda.cu and ggml-metal.m as independent dynamic shared objects, that may be conditionally linked at runtime in a multiplatform binary. It introduces a GGML_CALL annotation that documents which functions have a cyclic call relationship, between the application code and GPU modules. This change does nothing, unless the build defines -DGGML_MULTIPLATFORM which causes back-references and function pointers to conform to MS ABI which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms

) * Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue. * Fixed some issues and bugs of the grammar generator. Imporved Documentation * Update pydantic_models_to_grammar.py

* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+ * Only log on iOS and macOS, ignoring tvOS and other platforms * Check for Xcode version before using recommendedMaxWorkingSetSize --------- Co-authored-by: Georgi Gerganov <[email protected]>

…#4934) * Replace loop of dispatch_async with dispatch_apply * Update ggml-metal.m --------- Co-authored-by: Georgi Gerganov <[email protected]>

* Introduce starter project for Android Based on examples/llama.swiftui. * Add github workflow * Set NDK version * Only build arm64-v8a in CI * Sync bench code * Rename CI prop to skip-armeabi-v7a * Remove unused tests

* Metal: Localized logic in `ggml_metal_graph_compute`, minor performance improvement * Whitespace * Collecting command buffer completions on single thread * Whitespace * Reduce diff noise

…gerganov#4920) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/34fed993f1674c8d06d58b37ce1e0fe5eebcb9f5' (2023-12-01) → 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/e92039b55bcd58469325ded85d4f58dd5a4eaf58?dir=lib' (2023-11-29) → 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/cfc3698c31b1fb9cdcf10f36c9643460264d0ca8' (2023-12-27) → 'github:NixOS/nixpkgs/317484b1ead87b9c1b8ac5261a8d2dd748a0492d' (2024-01-08) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

ggml-ci

…4974)

* imatrix: adding support for legacy quants * imatrix: guard Q4_0/Q5_0 against ffn_down craziness --------- Co-authored-by: Iwan Kawrakow <[email protected]>

This commit adds the name of the training data file to the log message printed when the training data is tokenized. The motivation for this change is that it can be useful to show which file is being tokenized when running the finetune example. Signed-off-by: Daniel Bevenius <[email protected]>

bullno1 and others added 30 commits December 17, 2023 11:57

Link to cublas dynamically on Windows even with LLAMA_STATIC (ggergan…

5daa5f5

…ov#4506)

server : allow requests larger than 8K (ggerganov#4500)

62bd52b

server : fix possible ambiguity in content type charset (ggerganov#4501)

eb16dae

server : fix grammar being ignored (ggerganov#4494)

8edd2b4

Fix bug in identifying the grammar.

server : disable llm logs if SERVER_VERBOSE is off (ggerganov#3792)

0ffc92d

finetune : keep allocs alive until all allocations are done (ggergano…

4566863

…v#4486)

build : Check the ROCm installation location (ggerganov#4485)

919c406

* build : Check the ROCm installation location * more generic approach * fixup! It was returning the path instead of the command output * fixup! Trailing whitespace

gguf-py : fail fast on nonsensical special token IDs (ggerganov#4489)

f7f468a

readme : update hot topics

b1306c4

decode : fix logits_valid for legacy API (ggerganov#4516)

2994f0c

llama : fix try_override for bool_value which always return true (gge…

3c04bf6

…rganov#4519)

llama.swiftui : add more models

6ff39b1

llama.swiftui : add tinyllama 1.1B F16

0e18b2e

ggml-cuda: Fix HIP build (ggerganov#4528)

a7aee47

regression of ggerganov#4490 Adds defines for two new datatypes cublasComputeType_t, cudaDataType_t. Currently using deprecated hipblasDatatype_t since newer ones very recent.

ggml : fixed check for _MSC_VER (ggerganov#4535)

328b83d

Co-authored-by: Eric Sommerlade <[email protected]>

CUDA: Faster Mixtral prompt processing (ggerganov#4538)

799fc22

* CUDA: make MoE tensors contiguous for batch size>1 * Update ggml-cuda.cu Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

Fix access violation in ggml_cuda_free_data if tensor->extra is NULL (g…

1d7a191

…gerganov#4554)

llama : disable per-tensor info prints on model load (ggerganov#4562)

d3223af

cuda : replace asserts in wrong architecture checks with __trap (gger…

1398823

…ganov#4556) * cuda : replace asserts in wrong architecture checks with __trap * make bad_arch noreturn, remove returns

cuda : better error message for ggml_get_rows (ggerganov#4561)

66f35a2

* Update ggml-cuda.cu * Update ggml-cuda.cu * Update ggml-cuda.cu --------- Co-authored-by: Georgi Gerganov <[email protected]>

readme : update coding guidelines

c083718

CUDA: mul_mat_id always on GPU for batches >= 32 (ggerganov#4553)

9154494

common : remove incorrect --model-draft default (ggerganov#4568)

8fe03ff

ggml-cuda: Fix HIP build by adding define for __trap (ggerganov#4569)

562cf22

Regression of 1398823 HIP doesn't have trap, only abort

metal : fix ggml_metal_log vargs (ggerganov#4373)

56fa508

ggerganov and others added 25 commits January 14, 2024 11:03

llama : use LLAMA_LOG_ macros for logging

03c5267

scripts : sync-ggml-am.sh option to skip commits

9408cfd

llama : check LLAMA_TRACE env for extra logging (ggerganov#4929)

bb0c139

* llama : minor fix indent * llama : check LLAMA_TRACE env for extra logging ggml-ci

Add ability to use importance matrix for all k-quants (ggerganov#4930)

467a882

Co-authored-by: Iwan Kawrakow <[email protected]>

llama : fix missing quotes (ggerganov#4937)

a836c8f

CUDA: faster dequantize kernels for Q4_0 and Q4_1 (ggerganov#4938)

4a3156d

Co-authored-by: Iwan Kawrakow <[email protected]>

llama : check for 256 divisibility for IQ2_XS, IQ2_XXS (ggerganov#4950)

2faaef3

Co-authored-by: Iwan Kawrakow <[email protected]>

cuda : fix dequantize kernel names (ggerganov#4938)

ddb008d

awq-py : fix typo in awq-py/README.md (ggerganov#4947)

d9aa4ff

llama : apply classifier-free guidance to logits directly (ggerganov#…

4483396

…4951)

pass cpu-architecture arguments only to host code (C;C++) (ggerganov#…

3e5ca79

…4943)

speculative : threading options (ggerganov#4959)

e032428

* speculative: expose draft threading * fix usage format * accept -td and -tbd args * speculative: revert default behavior when -td is unspecified * fix trailing whitespace

finetune : use LLAMA_FILE_MAGIC_GGLA (ggerganov#4961)

d75c232

This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h. Signed-off-by: Daniel Bevenius <[email protected]>

metal : replace loop of dispatch_async with dispatch_apply (ggerganov…

3a48d55

…#4934) * Replace loop of dispatch_async with dispatch_apply * Update ggml-metal.m --------- Co-authored-by: Georgi Gerganov <[email protected]>

android : introduce starter project example (ggerganov#4926)

862f5e4

* Introduce starter project for Android Based on examples/llama.swiftui. * Add github workflow * Set NDK version * Only build arm64-v8a in CI * Sync bench code * Rename CI prop to skip-armeabi-v7a * Remove unused tests

metal : localized logic in ggml_metal_graph_compute (ggerganov#4924)

158f8c9

* Metal: Localized logic in `ggml_metal_graph_compute`, minor performance improvement * Whitespace * Collecting command buffer completions on single thread * Whitespace * Reduce diff noise

perplexity : fix kv cache handling for hellaswag (ggerganov#4981)

959ef0c

ggml-ci

examples : add complete parallel function calling example (ggerganov#…

4feb4b3

…4974)

ggml : importance matrix support for legacy quants (ggerganov#4969)

334a835

* imatrix: adding support for legacy quants * imatrix: guard Q4_0/Q5_0 against ffn_down craziness --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Update to b1892

5b8b6ee

boxbeam changed the title ~~Update b1892 merged~~ Update to b1892 Jan 18, 2024

boxbeam added 2 commits January 18, 2024 16:06

Remove trailing whitespace from ggml_metal_file.c

6daba4d

Remerge ggml-metal.m

e6d8c68

boxbeam closed this Jan 23, 2024

boxbeam deleted the update-b1892-merged branch January 23, 2024 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to b1892 #5

Update to b1892 #5

boxbeam commented Jan 18, 2024

Update to b1892 #5

Update to b1892 #5

Conversation

boxbeam commented Jan 18, 2024