Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync master #15

Merged
merged 132 commits into from
Jun 24, 2024
Merged

sync master #15

merged 132 commits into from
Jun 24, 2024

Commits on Jun 4, 2024

  1. ggml : prevent builds with -ffinite-math-only (ggerganov#7726)

    This enforces a check that -fno-finite-math-only was set and that the operating
    compiling mode is not in finite maths mode. This is because during rewriting of
    silu and softmax for cpu ggerganov#7154 there emerged an issue where the result that was
    observed when >1 slot was nondeterministic as found by @JohannesGaessler.
    
    @LostRuins narrowed the problem down to -ffinite-math-only which was theorised
    to be due to SiLU, instead of flushing small values to 0, returns NaN or some 
    other garbage. @jart proposed a fix that @ggerganov then implemented in this fix
    
    ref ggerganov#7154 (comment)
    ggerganov authored Jun 4, 2024
    Configuration menu
    Copy the full SHA
    6d16169 View commit details
    Browse the repository at this point in the history
  2. Per token attributes (ggerganov#7685)

    * Add per token attributes enum
    * Using phi-3 for testing 'rstrip'
    * Using jina-v2 for testing 'lstrip'
    * Brute force test for 'lstrip' and 'rstrip'
    * Implement 'rstrip' and 'lstrip'
    * Update phi-3 GGUF file (obsolete since 917dc8c)
    * Replace llama_token_type with llama_token_attribs
    jaime-m-p authored Jun 4, 2024
    Configuration menu
    Copy the full SHA
    3b38d48 View commit details
    Browse the repository at this point in the history
  3. refine .gitignore (ggerganov#7688)

    This adds tags and android ndk into the git ignore list
    zhouwg authored Jun 4, 2024
    Configuration menu
    Copy the full SHA
    b226c12 View commit details
    Browse the repository at this point in the history
  4. Improve hipBLAS support in CMake (ggerganov#7696)

    * Improve hipBLAS support in CMake
    
    This improves the detection of the correct CMAKE_PREFIX_PATH when using different distributions or a self-built ROCm SDK.
    
    * Set ROCM_PATH correctly
    daniandtheweb authored Jun 4, 2024
    Configuration menu
    Copy the full SHA
    987d743 View commit details
    Browse the repository at this point in the history
  5. llama-bench : allow using a different printer for stderr with -oe (gg…

    …erganov#7722)
    
    compare-commits.sh : hide stdout, use -oe to print markdown
    slaren authored Jun 4, 2024
    Configuration menu
    Copy the full SHA
    adc9ff3 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    5ca0944 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    0cd6bd3 View commit details
    Browse the repository at this point in the history
  8. ggml : remove OpenCL (ggerganov#7735)

    ggml-ci
    ggerganov authored Jun 4, 2024
    Configuration menu
    Copy the full SHA
    554c247 View commit details
    Browse the repository at this point in the history
  9. common : refactor cli arg parsing (ggerganov#7675)

    * common : gpt_params_parse do not print usage
    
    * common : rework usage print (wip)
    
    * common : valign
    
    * common : rework print_usage
    
    * infill : remove cfg support
    
    * common : reorder args
    
    * server : deduplicate parameters
    
    ggml-ci
    
    * common : add missing header
    
    ggml-ci
    
    * common : remote --random-prompt usages
    
    ggml-ci
    
    * examples : migrate to gpt_params
    
    ggml-ci
    
    * batched-bench : migrate to gpt_params
    
    * retrieval : migrate to gpt_params
    
    * common : change defaults for escape and n_ctx
    
    * common : remove chatml and instruct params
    
    ggml-ci
    
    * common : passkey use gpt_params
    ggerganov authored Jun 4, 2024
    Configuration menu
    Copy the full SHA
    1442677 View commit details
    Browse the repository at this point in the history
  10. Allow number of nodes in CUDA graph to change (ggerganov#7738)

    Previously the code would have failed to cope in the case that the
    number of nodes changes in an existing CUDA graph. This fixes the
    issue by removing an unnecessary conditional.
    agray3 authored Jun 4, 2024
    Configuration menu
    Copy the full SHA
    b90dc56 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    c90dbe0 View commit details
    Browse the repository at this point in the history

Commits on Jun 5, 2024

  1. readme : remove -ins (ggerganov#7759)

    -ins and --instruct were moved in ggerganov#7675
    
    I have adjusted the README accordingly.
    There was no trace of --chatml in the README.
    arch-btw authored Jun 5, 2024
    Configuration menu
    Copy the full SHA
    9973e81 View commit details
    Browse the repository at this point in the history
  2. ggml : refactor rope norm/neox (ggerganov#7634)

    * ggml : unify rope norm/neox (CPU)
    
    * ggml : fix compile warning
    
    * ggml : remove GLM rope mode
    
    ggml-ci
    
    * metal : better rope implementation
    
    ggml-ci
    
    * cuda : better rope implementation
    
    ggml-ci
    
    * naming : n_orig_ctx -> n_ctx_orig
    
    ggml-ci
    
    * dev : add reminders to update backends
    
    ggml-ci
    
    * vulkan : fix ggml_rope_ext() usage
    
    * cuda : fix array size + indents
    
    ggml-ci
    ggerganov authored Jun 5, 2024
    Configuration menu
    Copy the full SHA
    2b33896 View commit details
    Browse the repository at this point in the history
  3. CUDA: refactor mmq, dmmv, mmvq (ggerganov#7716)

    * CUDA: refactor mmq, dmmv, mmvq
    
    * fix out-of-bounds write
    
    * struct for qk, qr, qi
    
    * fix cmake build
    
    * mmq_type_traits
    JohannesGaessler authored Jun 5, 2024
    Configuration menu
    Copy the full SHA
    7d1a378 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    7672ade View commit details
    Browse the repository at this point in the history

Commits on Jun 6, 2024

  1. Configuration menu
    Copy the full SHA
    d67caea View commit details
    Browse the repository at this point in the history
  2. docker : build only main and server in their images (ggerganov#7782)

    * add openmp lib to dockerfiles
    
    * build only main and server in their docker images
    slaren authored Jun 6, 2024
    Configuration menu
    Copy the full SHA
    2d08b7f View commit details
    Browse the repository at this point in the history
  3. llama : add jina v2 base code (ggerganov#7596)

    * feat: add changes to handle jina v2 base code
    
    * fix: do not complicate things
    
    * fix: fix the usage of the code model
    
    * fix: fix comments
    
    * fix: fix linting issues
    
    * fix: remove ollama patches
    
    * style : minor
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    JoanFM and ggerganov authored Jun 6, 2024
    Configuration menu
    Copy the full SHA
    f5d7b26 View commit details
    Browse the repository at this point in the history
  4. grammars: x{min,max} repetition operator (ggerganov#6640)

    * grammars: x{min,max} repetition operator + tweak +/*/? to avoid duplication of original over alternates
    
    * grammars: handle `x{n}` and fix `x{n,n}`
    
    * grammars: document new repetition operators
    
    * grammars: uniform use of int for min & max
    
    * grammars: refactor parser test
    
    * grammar: parsing tests w/ natural pretty print of updated expectations
    
    * grammars: much prettier print of expectations (+ TEST_GRAMMAR_PARSER_PRINT_ALL=1 to force all)
    
    * grammars: improve test pretty print again
    
    * grammars: pretty print rules and chars
    
    * grammars: fix copy rule skipping
    
    * grammars: disallow `a{,}` (not allowed in regexps)
    
    * Update common/grammar-parser.cpp
    
    Co-authored-by: Clint Herron <[email protected]>
    
    * grammars: fix copy rule skipping (again) & display of expectations
    
    * grammars: more test cases
    
    * grammars: update reps parsing to bring ? / * / + closer to before
    
    * json: use new GBNF repetitions{m,n} syntax
    
    * grammars: update performance gotchas w/ repetition advice
    
    * Update examples/json_schema_to_grammar.py
    
    Co-authored-by: Clint Herron <[email protected]>
    
    * Update examples/server/public/json-schema-to-grammar.mjs
    
    Co-authored-by: Clint Herron <[email protected]>
    
    * grammars: comment on rule repetitions
    
    * grammars: ensure unambiguous number alternatives
    
    * grammar: nit typo switched error msgs
    
    * grammar: nit numbering in comment
    
    * json: update numeric rule to be unambiguous
    
    * Apply suggestions from code review
    
    Co-authored-by: Clint Herron <[email protected]>
    
    * Update examples/server/public/json-schema-to-grammar.mjs
    
    Co-authored-by: Clint Herron <[email protected]>
    
    * json: fix integral-part
    
    * grammar: add repetition tests
    
    ---------
    
    Co-authored-by: Clint Herron <[email protected]>
    ochafik and HanClinto authored Jun 6, 2024
    Configuration menu
    Copy the full SHA
    55b2d08 View commit details
    Browse the repository at this point in the history
  5. README minor fixes (ggerganov#7798) [no ci]

    derievatives --> derivatives
    Chediak authored Jun 6, 2024
    Configuration menu
    Copy the full SHA
    a143c04 View commit details
    Browse the repository at this point in the history
  6. Added support for . (any character) token in grammar engine. (ggergan…

    …ov#6467)
    
    * Added support for . (any characer) token in grammar engine.
    
    * Add integration tests for any-character symbol.
    HanClinto authored Jun 6, 2024
    Configuration menu
    Copy the full SHA
    ad675e1 View commit details
    Browse the repository at this point in the history
  7. imatrix : migrate to gpt_params (ggerganov#7771)

    * imatrix : migrate to gpt_params
    
    ggml-ci
    
    * imatrix : add --save-frequency cli arg
    
    * common : fix --no-ppl
    ggerganov authored Jun 6, 2024
    Configuration menu
    Copy the full SHA
    f83351f View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    ee459f4 View commit details
    Browse the repository at this point in the history

Commits on Jun 7, 2024

  1. check for nans in imatrix and quantize (ggerganov#7807)

    * imatrix : detect nan/inf values
    
    * quantize : check imatrix for nan/inf values
    slaren authored Jun 7, 2024
    Configuration menu
    Copy the full SHA
    c9ee711 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    d5c938c View commit details
    Browse the repository at this point in the history
  3. server : do not get prompt in infill mode (ggerganov#7286)

    * avoid to get prompt in infill mode and embedding mode
    
    * remove embedding mode
    
    * refactor format
    
    ---------
    
    Co-authored-by: wudexiang <[email protected]>
    woodx9 and wudexiang authored Jun 7, 2024
    Configuration menu
    Copy the full SHA
    a5cabd7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    7027b27 View commit details
    Browse the repository at this point in the history
  5. cmake : fix BUILD_SHARED_LIBS=ON build (ggerganov#7784)

    common depends on pthreads in Linux
    intelmatt authored Jun 7, 2024
    Configuration menu
    Copy the full SHA
    27615f5 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    c00fad7 View commit details
    Browse the repository at this point in the history
  7. vulkan : reuse parent extra for views (ggerganov#7806)

    * vulkan : reuse parent extra for views
    
    * Fix validation error when multiple compute contexts are used in a graph
    
    ---------
    
    Co-authored-by: 0cc4m <[email protected]>
    slaren and 0cc4m authored Jun 7, 2024
    Configuration menu
    Copy the full SHA
    da799b4 View commit details
    Browse the repository at this point in the history

Commits on Jun 8, 2024

  1. server : smart slot selection using Longest Common Prefix (ggerganov#…

    …7728)
    
    * server : Smart selection of available slot using Longest Common Substring
    
    * add usage
    
    * remove trailing whitespaces
    
    * Use Longest Common Prefix (LCP) instead of LCS
    
    * Rename argument
    sasha0552 authored Jun 8, 2024
    Configuration menu
    Copy the full SHA
    7a16ce7 View commit details
    Browse the repository at this point in the history
  2. url: save -mu downloads to new cache location (ggerganov#7826)

    * url: save -mu download to new cache location
    
    * url: fs_get_cache_file_path util
    
    * url: tweak sig of fs_get_cache_file
    ochafik authored Jun 8, 2024
    Configuration menu
    Copy the full SHA
    d4d915d View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    fe1e391 View commit details
    Browse the repository at this point in the history

Commits on Jun 9, 2024

  1. gguf-py : decouple adding metadata from writing in GGUFWriter (ggerga…

    …nov#7827)
    
    Main changes of this PR is to consolidate GGUFWriter.add_key and GGUFWriter.add_val into GGUFWriter.add_key_value. 
    
    In addition use_temp_file is now opt-in instead of opt-out defaulting to False.
    
    Also GGUFWriter now does not require output file name until when actually writing to it.
    
    And GGUFWriter doesn't really need to eagerly prepare the data layout of the metadata
    compilade authored Jun 9, 2024
    Configuration menu
    Copy the full SHA
    ed9f252 View commit details
    Browse the repository at this point in the history
  2. convert-hf : match model part name prefix and suffix (ggerganov#7687)

    In ggerganov#7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part we simply used the same logic as the part count to get the part names. 
    
    But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present.
    
    This commit matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models. But according to report by @teleprint-me there is still some
    persistent problem, but shall do in the meantime.
    compilade authored Jun 9, 2024
    Configuration menu
    Copy the full SHA
    5795b94 View commit details
    Browse the repository at this point in the history
  3. convert-hf : set the model name based on cli arg, if present (ggergan…

    …ov#7693)
    
     `--model-name` argument was added a while ago but did not do anything.
    This commit fixes this issue and enables this feature.
    sasha0552 authored Jun 9, 2024
    Configuration menu
    Copy the full SHA
    2decf57 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    42b53d1 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    3e2ee44 View commit details
    Browse the repository at this point in the history
  6. docs: Added initial PR template with directions for doc only changes …

    …and squash merges [no ci] (ggerganov#7700)
    
    This commit adds pull_request_template.md and CONTRIBUTING.md . It focuses on explaining to contributors the need to rate PR complexity level, when to add [no ci] and how to format PR title and descriptions.
    
    Co-authored-by: Brian <[email protected]>
    Co-authored-by: compilade <[email protected]>
    3 people authored Jun 9, 2024
    Configuration menu
    Copy the full SHA
    57bf62c View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    e95beeb View commit details
    Browse the repository at this point in the history
  8. flake.lock: Update (ggerganov#7838)

    Flake lock file updates:
    
    • Updated input 'nixpkgs':
        'github:NixOS/nixpkgs/ad57eef4ef0659193044870c731987a6df5cf56b?narHash=sha256-SzDKxseEcHR5KzPXLwsemyTR/kaM9whxeiJohbL04rs%3D' (2024-05-29)
      → 'github:NixOS/nixpkgs/051f920625ab5aabe37c920346e3e69d7d34400e?narHash=sha256-4q0s6m0GUcN7q%2BY2DqD27iLvbcd1G50T2lv08kKxkSI%3D' (2024-06-07)
    
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
    ggerganov and github-actions[bot] authored Jun 9, 2024
    Configuration menu
    Copy the full SHA
    10ceba3 View commit details
    Browse the repository at this point in the history

Commits on Jun 10, 2024

  1. Configuration menu
    Copy the full SHA
    af4ae50 View commit details
    Browse the repository at this point in the history
  2. CUDA: use tensor cores for MMQ (ggerganov#7676)

    * CUDA: int8 tensor cores for MMQ (legacy quants)
    
    * fix out-of-bounds writes
    
    * __builtin_assume -> GGML_CUDA_ASSUME
    
    * fix writeback returning too early
    JohannesGaessler authored Jun 10, 2024
    Configuration menu
    Copy the full SHA
    1f0dabd View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    d9da0e4 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    c28a839 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    fd5ea0f View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    864a99e View commit details
    Browse the repository at this point in the history

Commits on Jun 11, 2024

  1. json: document schema conversion in GBNF readme, align manual gramm…

    …ar examples & converters (ggerganov#7841)
    
    * json: fix char pattern in grammar converters
    
    * json: prevent number precision & whitespace runaways in example grammars
    
    * json: add doc to grammar readme
    ochafik authored Jun 11, 2024
    Configuration menu
    Copy the full SHA
    396b18d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b61eb96 View commit details
    Browse the repository at this point in the history
  3. fix CUDA CI by using a windows-2019 image (ggerganov#7861)

    * try to fix CUDA ci with --allow-unsupported-compiler
    
    * trigger when build.yml changes
    
    * another test
    
    * try exllama/bdashore3 method
    
    * install vs build tools before cuda toolkit
    
    * try win-2019
    slaren authored Jun 11, 2024
    Configuration menu
    Copy the full SHA
    c2ce6c4 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    bdcb8f4 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    4bfe50f View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    148995e View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    6fe42d0 View commit details
    Browse the repository at this point in the history
  8. fix broken link in pr template (ggerganov#7880) [no ci]

    * fix broken link in pr template
    
    * Update pull_request_template.md [no ci]
    
    ---------
    
    Co-authored-by: Brian <[email protected]>
    deven367 and mofosyne authored Jun 11, 2024
    Configuration menu
    Copy the full SHA
    14f8352 View commit details
    Browse the repository at this point in the history
  9. Update Vulkan RoPE implementation (ggerganov#7818)

    * Update Vulkan RoPE implementation
    
    * Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception
    
    Minor fixes
    
    * Fix segfault when running out of VRAM
    
    Co-authored-by: slaren <[email protected]>
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    0cc4m and slaren authored Jun 11, 2024
    Configuration menu
    Copy the full SHA
    ef52d1d View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    73bac2b View commit details
    Browse the repository at this point in the history

Commits on Jun 12, 2024

  1. Fix a typo and add Fedora 40 pacakge to install for Vulkan (ggerganov…

    …#7794) [no ci]
    
    Fix "appropiate" to "appropriate" and add Fedora 40 packages to install to compile with Vulkan support
    metal3d authored Jun 12, 2024
    Configuration menu
    Copy the full SHA
    f2b5764 View commit details
    Browse the repository at this point in the history
  2. update intel docker oneapi-basekit to 2024.1.1-devel-ubuntu22.04 (gge…

    …rganov#7894)
    
    In addition this reverts a workaround we had to do to workaround the upstream issue with expired intel GPG package keys in 2024.0.1-devel-ubuntu22.04
    airMeng authored Jun 12, 2024
    Configuration menu
    Copy the full SHA
    dcf7527 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    704a35b View commit details
    Browse the repository at this point in the history
  4. ggml : improve ggml_is_contiguous logic (ggerganov#7856)

    * ggml : improve ggml_is_contiguous logic
    
    ggml-ci
    
    * ggml : support more contiguous cases
    
    ggml-ci
    ggerganov authored Jun 12, 2024
    Configuration menu
    Copy the full SHA
    bfaa676 View commit details
    Browse the repository at this point in the history
  5. tests : add non-cont unary tests (ggerganov#7857)

    * tests : add non-cont unary tests
    
    * ggml : update unary asserts and "supports_op"
    
    ggml-ci
    ggerganov authored Jun 12, 2024
    Configuration menu
    Copy the full SHA
    a9cae48 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    9635529 View commit details
    Browse the repository at this point in the history
  7. build: rename main → llama-cli, server → llama-server, llava-cli → …

    …llama-llava-cli, etc... (ggerganov#7809)
    
    * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew
    
    * server: update refs -> llama-server
    
    gitignore llama-server
    
    * server: simplify nix package
    
    * main: update refs -> llama
    
    fix examples/main ref
    
    * main/server: fix targets
    
    * update more names
    
    * Update build.yml
    
    * rm accidentally checked in bins
    
    * update straggling refs
    
    * Update .gitignore
    
    * Update server-llm.sh
    
    * main: target name -> llama-cli
    
    * Prefix all example bins w/ llama-
    
    * fix main refs
    
    * rename {main->llama}-cmake-pkg binary
    
    * prefix more cmake targets w/ llama-
    
    * add/fix gbnf-validator subfolder to cmake
    
    * sort cmake example subdirs
    
    * rm bin files
    
    * fix llama-lookup-* Makefile rules
    
    * gitignore /llama-*
    
    * rename Dockerfiles
    
    * rename llama|main -> llama-cli; consistent RPM bin prefixes
    
    * fix some missing -cli suffixes
    
    * rename dockerfile w/ llama-cli
    
    * rename(make): llama-baby-llama
    
    * update dockerfile refs
    
    * more llama-cli(.exe)
    
    * fix test-eval-callback
    
    * rename: llama-cli-cmake-pkg(.exe)
    
    * address gbnf-validator unused fread warning (switched to C++ / ifstream)
    
    * add two missing llama- prefixes
    
    * Updating docs for eval-callback binary to use new `llama-` prefix.
    
    * Updating a few lingering doc references for rename of main to llama-cli
    
    * Updating `run-with-preset.py` to use new binary names.
    Updating docs around `perplexity` binary rename.
    
    * Updating documentation references for lookup-merge and export-lora
    
    * Updating two small `main` references missed earlier in the finetune docs.
    
    * Update apps.nix
    
    * update grammar/README.md w/ new llama-* names
    
    * update llama-rpc-server bin name + doc
    
    * Revert "update llama-rpc-server bin name + doc"
    
    This reverts commit e474ef1.
    
    * add hot topic notice to README.md
    
    * Update README.md
    
    * Update README.md
    
    * rename gguf-split & quantize bins refs in **/tests.sh
    
    ---------
    
    Co-authored-by: HanClinto <[email protected]>
    ochafik and HanClinto authored Jun 12, 2024
    Configuration menu
    Copy the full SHA
    1c641e6 View commit details
    Browse the repository at this point in the history

Commits on Jun 13, 2024

  1. move BLAS to a separate backend (ggerganov#6210)

    * move BLAS to a separate backend
    
    * rename GGML_USE_OPENBLAS to GGML_USE_BLAS
    
    * alloc : reuse same buffer when the same buffer type if used multiple times
    
    * set number of threads automatically for openblas and blis
    
    * sched : print assignments when GGML_SCHED_DEBUG env variable is set
    
    * sched : allow ops with weights on an incompatible buffer type
    
    This will cause the weight to be copied to a backend that supports the
    op, which is very costly. The weight should have been stored in a buffer
    of a backend that can run the op, but llama.cpp cannot do this
    automatically at the moment.
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    slaren and ggerganov authored Jun 13, 2024
    Configuration menu
    Copy the full SHA
    f578b86 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a55eb1b View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    172c825 View commit details
    Browse the repository at this point in the history

Commits on Jun 14, 2024

  1. convert : add Poro-34B-chat tokenizer support (ggerganov#7713)

    * support for Poro chat pre-tokenizer
    
    * add support for Poro pre-tokenizer
    
    * Update convert-hf-to-gguf-update.py
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * Change Poro-34B-chat to poro-chat
    
    * Change Poro-34B-chat to poro-chat
    
    * Update convert-hf-to-gguf-update.py
    
    * Update llama.cpp
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    ezosa and ggerganov authored Jun 14, 2024
    Configuration menu
    Copy the full SHA
    41b9260 View commit details
    Browse the repository at this point in the history
  2. llama : more checks before assuming FIM tokens (ggerganov#7644)

    * More checks before assuming FIM tokens for Llama arch
    
    * extensive token check
    CISC authored Jun 14, 2024
    Configuration menu
    Copy the full SHA
    6fcd133 View commit details
    Browse the repository at this point in the history
  3. llama-bench : fix RPC indication (ggerganov#7936)

    Show "<backend_name>+RPC" when RPC offloading is used
    rgerganov authored Jun 14, 2024
    Configuration menu
    Copy the full SHA
    e65bbf6 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    66ef1ce View commit details
    Browse the repository at this point in the history
  5. CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (ggerganov#7921)

    * CUDA: faster q2_K, q3_K MMQ + int8 tensor cores
    
    * try CI fix
    
    * try CI fix
    
    * try CI fix
    
    * fix data race
    
    * rever q2_K precision related changes
    JohannesGaessler authored Jun 14, 2024
    Configuration menu
    Copy the full SHA
    76d66ee View commit details
    Browse the repository at this point in the history
  6. ci : fix macos x86 build (ggerganov#7940)

    In order to use old `macos-latest` we should use `macos-12`
    
    Potentially will fix: ggerganov#6975
    olexiyb authored Jun 14, 2024
    Configuration menu
    Copy the full SHA
    f8ec887 View commit details
    Browse the repository at this point in the history

Commits on Jun 15, 2024

  1. [SYCL] remove global variables (ggerganov#7710)

    * separate DPCT helpers outside
    
    * replace global variables with context
    
    * remove useless extra
    
    * update mul_mat condition
    
    * remove duplicate buft initialization
    
    * remove duplicate extra and global work group size
    
    * remove useless backend check
    
    * remove duplicated extras
    
    * use macro for group_size and remove cuda-related
    airMeng authored Jun 15, 2024
    Configuration menu
    Copy the full SHA
    7b2f4a7 View commit details
    Browse the repository at this point in the history
  2. Add cvector-generator example (ggerganov#7514)

    * add control-vector-generator
    
    * calc diff
    
    * add comments
    
    * proof-of-concept stdlib implementation
    
    Implements PCA and file writing using mostly standard libraries. The output is recognized as a functional control vector, but outputs gibberish.
    
    * param parsing, refactor, comments
    
    Added basic command-line parameters for outfile and one each positive/negative prompt.
    
    Refactored some messy code in PCA computation and GGUF exporting.
    
    Left a bunch of comments regarding further work needed.
    
    * example template completions
    
    Implements an example template set built from the positive/negative prompts like the control vector Python implementation.
    
    * add multi prompts, multi-thread for PCA
    
    * fix mem error
    
    * add debugs
    
    * fix matrix transpose multiplication
    
    you have got to be kidding me
    
    * preliminary template/multiprompt support
    
    model is running out of context and that ought to be fixed (segfaulting) but other than that it looks goodish
    
    * fix zero output & param parsing, functional templating
    
    fixed a bug where the output file had no tensor data/was all zero
    
    fixed a bug where single hyphen flags were not being correctly parsed
    
    implements creation of templated prompts from input (still need to adapt based on model)
    
    * fix square_diff matmul index range and CRLF->LF line endings
    
    fixed a logic error where square_diff would not multiply all rows
    
    fixed a formatting error where the provided completions.txt had CRLF line endings
    
    * add command-line args for num threads, num completions file lines, always reload model
    
    refactored a few things and did what the commit message says on the tin
    
    * code aestheticization
    
    * fix compiler warnings
    
    * in-series multithreading for prompt embedding?
    
    added commented-out code to attempt to start implementing mutlithreading for embedding in main
    
    * remove unnecessary multithreading
    
    * interim fix memory leak
    
    * translated everything but PCA (I think)
    
    * tentatively translate the rest
    
    * fix ggml errors and make new ones
    
    at least it compiles and runs
    
    * fix cb_eval
    
    * temporary commit while I move dev environments
    
    it finally outputs a functioning control vector - "functioning" in the sense that it can be loaded and it clearly has the right idea, but makes the model incoherent
    
    * update debug statements
    
    * pre-tokenize so we can allocate correct memory to ctx_diffs_wrapped
    
    * update comments
    
    * (wip) refactor
    
    * clean up PCA ggml implementation
    
    * fix shape of v_diff_original
    
    * add n_batch for pca
    
    * working version
    
    * remember to copy back the last_eigenvector
    
    * fix n_completions
    
    * bring back n_completions
    
    * default n_pca_batch to 20
    
    * fix macos build
    
    * add to makefile all targets
    
    * use ggml_format_name
    
    * add readme
    
    * fix .editorconfig
    
    * use ggml_backend_tensor_copy
    
    * attemp to fix compile problem on mac
    
    * fix compile warn
    
    * reuse allocr
    
    * move param parser to common
    
    * better error handling
    
    * clean up a bit
    
    * add print_usage
    
    * shorten help msg
    
    * beautify help msg
    
    * escape prompt by default
    
    * change compile target to llama-cvector-generator
    
    * typo
    
    * disable GPU for PCA
    
    * code style
    
    ---------
    
    Co-authored-by: Christian Zhou-Zheng <[email protected]>
    ngxson and christianazinn authored Jun 15, 2024
    Configuration menu
    Copy the full SHA
    0c7b359 View commit details
    Browse the repository at this point in the history

Commits on Jun 16, 2024

  1. Vulkan Shader Refactor, Memory Debugging Option (ggerganov#7947)

    * Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory
    
    * Improve debug log code
    
    * Add memory debug output option
    
    * Fix flake8
    
    * Fix unnecessary high llama-3 VRAM use
    0cc4m authored Jun 16, 2024
    Configuration menu
    Copy the full SHA
    7c7836d View commit details
    Browse the repository at this point in the history
  2. github : update pr template

    ggerganov committed Jun 16, 2024
    Configuration menu
    Copy the full SHA
    c8a8219 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    cddaf02 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    6fe1c62 View commit details
    Browse the repository at this point in the history
  5. unicode : avoid char32_t (ggerganov#7957)

    ggml-ci
    ggerganov authored Jun 16, 2024
    Configuration menu
    Copy the full SHA
    5239925 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    bc6c457 View commit details
    Browse the repository at this point in the history
  7. ggml : remove duplicate include of ggml-common.h (ggml/853)

    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored and ggerganov committed Jun 16, 2024
    Configuration menu
    Copy the full SHA
    398105f View commit details
    Browse the repository at this point in the history
  8. ggml : fix and optimize ppc64le (ggml/849)

    * fix compile issues introduced by loongarch_asx
    
    * restore quant changes to merge
    
    * fix compile issues introduced by loongarch_asx
    
    * further optimize by using vec_msum & vec_sum4s on ppc64le
    penghongbo authored and ggerganov committed Jun 16, 2024
    Configuration menu
    Copy the full SHA
    b5fcf8e View commit details
    Browse the repository at this point in the history
  9. cuda : fix bounds check for src0 rows in MMVQ kernel (whisper/2231)

    * cuda : fix bounds check for src0 rows in MMVQ kernel
    
    * Update ggml-cuda/mmvq.cu
    
    Co-authored-by: Johannes Gäßler <[email protected]>
    
    ---------
    
    Co-authored-by: Johannes Gäßler <[email protected]>
    ggerganov and JohannesGaessler committed Jun 16, 2024
    Configuration menu
    Copy the full SHA
    19b7a83 View commit details
    Browse the repository at this point in the history
  10. Add support for sqrt on CUDA (ggerganov#7953)

    * cuda sqrt support
    
    * enable cuda in pca
    
    * fix comments in pca
    
    * add test
    
    * add sqrt to ggml_backend_cuda_supports_op
    
    * fix test
    
    * new line
    
    * Use F32 sqrtf instead of F64 sqrt
    
    Co-authored-by: Johannes Gäßler <[email protected]>
    
    ---------
    
    Co-authored-by: Johannes Gäßler <[email protected]>
    calvin-laurenson and JohannesGaessler authored Jun 16, 2024
    Configuration menu
    Copy the full SHA
    43b35e3 View commit details
    Browse the repository at this point in the history

Commits on Jun 17, 2024

  1. [SYCL] Update README-sycl.md for Chapter "Recommended release" and "N…

    …ews" (ggerganov#7946)
    
    * Update README-sycl.md
    
    * Update README-sycl.md
    
    * Update README-sycl.md
    
    * Update README-sycl.md
    arthw authored Jun 17, 2024
    Configuration menu
    Copy the full SHA
    df68d4f View commit details
    Browse the repository at this point in the history
  2. gguf-dump.py: add --markdown dump output (ggerganov#7853)

    * gguf-dump.py: add --markdown dump output
    
    * gguf-dump.py: Add toc
    
    * gguf-dump.py: use standard tensor name lookup. Also add tensor ID field
    
    * gguf-dump.py: Add tensor overview count
    
    * gguf-dump.py: fix array preview
    
    * gguf-dump.py: markdownTableWithAlignmentSupport() added
    
    * Add type hints and spacing
    
    Co-authored-by: compilade <[email protected]>
    
    * gguf-dump.py: prettyfy dimention
    
    * gguf-dump: right align element count
    
    * gguf-dump.py: element count autosizing
    
    * Apply suggestions from code review
    
    Co-authored-by: compilade <[email protected]>
    
    ---------
    
    Co-authored-by: compilade <[email protected]>
    mofosyne and compilade authored Jun 17, 2024
    Configuration menu
    Copy the full SHA
    006167a View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    21be9ca View commit details
    Browse the repository at this point in the history
  4. Implement non-mapped async IO for CUDA on Windows. (ggerganov#7896)

    * Implement non-mapped async IO for CUDA on Windows. On a fast Gen5 NVMe drive this change improves model load time by >3x while it should be the same (or slightly faster) on any other drive.
    
    * Free resources except for backend.
    
    * Change assertions to exceptions in llama_file, find correct cuda backend to create CUDA resources and respect the use_mmap flag again for CUDA.
    
    * Apply suggestions from code review
    
    Co-authored-by: slaren <[email protected]>
    
    * Fix editorconfig and unused variable
    
    * Fix issues with Windows build
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    mtavenrath and slaren authored Jun 17, 2024
    Configuration menu
    Copy the full SHA
    6a2f0b3 View commit details
    Browse the repository at this point in the history
  5. fix: divide 0 exception in mamba (ggerganov#7932)

    Signed-off-by: thxCode <[email protected]>
    thxCode authored Jun 17, 2024
    Configuration menu
    Copy the full SHA
    c637fcd View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    99052cd View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    b473e95 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    7c26775 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    5b6da18 View commit details
    Browse the repository at this point in the history
  10. update: support Qwen2-57B-A14B (ggerganov#7835)

    * update: convert-hf-to-gguf.py to support Qwen2-57B-A14B
    
    * fix: QWEN2MOE support for expert_feed_forward_length
    
    previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH
    
    n_ff_exp and n_ff_shared_exp are now properly calculated
    
    * update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM
    
    * fix: QWEN2MOE support for expert_feed_forward_length
    
    previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH
    
    n_ff_exp and n_ff_shexp are now properly calculated
    legraphista authored Jun 17, 2024
    Configuration menu
    Copy the full SHA
    a94e6ff View commit details
    Browse the repository at this point in the history

Commits on Jun 18, 2024

  1. whisper : use ggml_backend_sched (whisper/2239)

    * whisper : use ggml_backend_sched (wip)
    
    * use sched in whisper_allocr
    
    * whisper : single backend in whisper_context
    
    * whisper : remove whisper_state->backends_used
    
    * whisper : remove whisper_context->backend
    
    * whisper : reset scheduler after init
    
    * whisper : fix external encoder (e.g. CoreML)
    
    * whisper : cleanup
    
    * whisper : handle null GPU buffer types + fix sycl
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    ggerganov and slaren committed Jun 18, 2024
    Configuration menu
    Copy the full SHA
    e6ecc2b View commit details
    Browse the repository at this point in the history
  2. ggml : sync

    ggerganov committed Jun 18, 2024
    Configuration menu
    Copy the full SHA
    5326bcc View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1193778 View commit details
    Browse the repository at this point in the history
  4. chore: clean useless beam search param (ggerganov#7985)

    Signed-off-by: thxCode <[email protected]>
    thxCode authored Jun 18, 2024
    Configuration menu
    Copy the full SHA
    b96f9af View commit details
    Browse the repository at this point in the history
  5. Allow compiling with CUDA without CUDA runtime installed (ggerganov#7989

    )
    
    On hosts which are not prepared/dedicated to execute code using CUDA
    it is still possible to compile llama.cpp with CUDA support by just
    installing the development packages.  Missing are the runtime
    libraries like /usr/lib64/libcuda.so* and currently the link step
    will fail.
    
    The development environment is prepared for such situations.  There
    are stub libraries for all the CUDA libraries available in the
    $(CUDA_PATH)/lib64/stubs directory.  Adding this directory to the end
    of the search path will not change anything for environments which
    currently work fine but will enable compiling llama.cpp also in case
    the runtime code is not available.
    drepper authored Jun 18, 2024
    Configuration menu
    Copy the full SHA
    6166527 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    84f6de1 View commit details
    Browse the repository at this point in the history
  7. Only use FIM middle token if it exists (ggerganov#7648)

    * Only use FIM middle if it exists
    
    * Only use FIM middle if it exists
    CISC authored Jun 18, 2024
    Configuration menu
    Copy the full SHA
    91c188d View commit details
    Browse the repository at this point in the history
  8. tokenizer : BPE fixes (ggerganov#7530)

    * Random test: add_bos_token, add_eos_token
    * Random test: add BPE models for testing
    * Custom regex split fails with codepoint 0
    * Fix falcon punctuation regex
    * Refactor llm_tokenizer_bpe: move code to constructor
    * Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
    * Move tokenizer flags to vocab structure.
    * Default values for special_add_bos/eos
    * Build vocab.special_tokens_cache using vocab token types
    * Generalize 'jina-v2' per token attributes
    * Fix unicode whitespaces (deepseek-coder, deepseek-llm)
    * Skip missing byte tokens (falcon)
    * Better unicode data generation
    * Replace char32_t with uint32_t
    jaime-m-p authored Jun 18, 2024
    Configuration menu
    Copy the full SHA
    37bef89 View commit details
    Browse the repository at this point in the history

Commits on Jun 19, 2024

  1. [SYCL] refactor (ggerganov#6408)

    * seperate lower precision GEMM from the main files
    
    * fix workgroup size hardcode
    airMeng authored Jun 19, 2024
    Configuration menu
    Copy the full SHA
    623494a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a04a953 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    9c77ec1 View commit details
    Browse the repository at this point in the history
  4. un-ignore build-info.cmake and build-info.sh (ggerganov#7996)

    * un-ignore `build-info.cmake` and `build-info.sh`
    
    I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases.
    
    * un-ignore `build-info.cpp.in`
    
    For the same reason as the previous two files.
    
    * Reorganize `.gitignore`
    
    * Add exceptions for files mentioned by @slaren
    
    I did leave .clang-tidy since it was explicitly ignored before.
    
    * Add comments for organization
    * Sort some lines for pretty
    * Test with `make` and `cmake` builds to ensure no build artifacts might be comitted
    
    * Remove `.clang-tidy` from `.gitignore`
    
    Per comment by @ggerganov
    
    * Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`
    mdegans authored Jun 19, 2024
    Configuration menu
    Copy the full SHA
    a785474 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    ba58993 View commit details
    Browse the repository at this point in the history

Commits on Jun 20, 2024

  1. metal : fix ggml_metal_supports_op for BF16 (ggerganov#8021)

    Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.
    mdegans authored Jun 20, 2024
    Configuration menu
    Copy the full SHA
    2075a66 View commit details
    Browse the repository at this point in the history
  2. CUDA: stream-k decomposition for MMQ (ggerganov#8018)

    * CUDA: stream-k decomposition for MMQ
    
    * fix undefined memory reads for small matrices
    JohannesGaessler authored Jun 20, 2024
    Configuration menu
    Copy the full SHA
    d50f889 View commit details
    Browse the repository at this point in the history
  3. [SYCL] Fix windows build and inference (ggerganov#8003)

    * add sycl preset
    
    * fix debug link error. fix windows crash
    
    * update README
    luoyu-intel authored Jun 20, 2024
    Configuration menu
    Copy the full SHA
    de391e4 View commit details
    Browse the repository at this point in the history
  4. common: fix warning (ggerganov#8036)

    * common: fix warning
    
    * Update common/common.cpp
    
    Co-authored-by: slaren <[email protected]>
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    JohannesGaessler and slaren authored Jun 20, 2024
    Configuration menu
    Copy the full SHA
    abd894a View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    17b291a View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    b1ef562 View commit details
    Browse the repository at this point in the history

Commits on Jun 21, 2024

  1. Configuration menu
    Copy the full SHA
    0e64591 View commit details
    Browse the repository at this point in the history
  2. llama : allow pooled embeddings on any model (ggerganov#7477)

    * create append_pooling operation; allow to specify attention_type; add last token pooling; update examples
    
    * find result_norm/result_embd tensors properly; update output allocation logic
    
    * only use embd output for pooling_type NONE
    
    * get rid of old causal_attn accessor
    
    * take out attention_type; add in llama_set_embeddings
    
    * bypass logits when doing non-NONE pooling
    iamlemec authored Jun 21, 2024
    Configuration menu
    Copy the full SHA
    80ea089 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a927b0f View commit details
    Browse the repository at this point in the history
  4. ggml : AVX IQ quants (ggerganov#7845)

    * initial iq4_xs
    
    * fix ci
    
    * iq4_nl
    
    * iq1_m
    
    * iq1_s
    
    * iq2_xxs
    
    * iq3_xxs
    
    * iq2_s
    
    * iq2_xs
    
    * iq3_s before sllv
    
    * iq3_s
    
    * iq3_s small fix
    
    * iq3_s sllv can be safely replaced with sse multiply
    netrunnereve authored Jun 21, 2024
    Configuration menu
    Copy the full SHA
    7d5e877 View commit details
    Browse the repository at this point in the history
  5. vulkan: detect multiple devices by deviceUUID instead of deviceID (gg…

    …erganov#8022)
    
    * vulkan: detect multiple devices by deviceUUID instead of deviceID
    
    * vulkan: remove unneeded variables
    
    * vulkan: fix id query
    Adriankhl authored Jun 21, 2024
    Configuration menu
    Copy the full SHA
    557b653 View commit details
    Browse the repository at this point in the history

Commits on Jun 22, 2024

  1. JSON Schema to GBNF integration tests (ggerganov#7790)

    * Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars.
    
    * Adding additional examples as documented in ggerganov#7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.
    
    * Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs.
    
    * Merging improved schema test methods added by @ochafik in ggerganov#7797
    
    * Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework.
    
    * Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings.
    
    * Fixing grammar indentation to be consistent throughout file.
    HanClinto authored Jun 22, 2024
    Configuration menu
    Copy the full SHA
    c5a8d4b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5b48cd5 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    3aa184a View commit details
    Browse the repository at this point in the history
  4. cvector-generator: Moe Moe Fixie-Fixie for Lots of Formats~! ♡(ᐢ ᴥ ᐢ)♡ (

    ggerganov#8052)
    
    * Update negative.txt
    
    * Update positive.txt
    
    * Update cvector-generator.cpp
    
    * Update cvector-generator.cpp
    HatsuneMikuUwU33 authored Jun 22, 2024
    Configuration menu
    Copy the full SHA
    adf480c View commit details
    Browse the repository at this point in the history
  5. cvector: fix CI + correct help message (ggerganov#8064)

    * cvector: fix CI + correct help message
    
    * also correct --pca-iter
    ngxson authored Jun 22, 2024
    Configuration menu
    Copy the full SHA
    3e58b0e View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    b5a5f34 View commit details
    Browse the repository at this point in the history

Commits on Jun 23, 2024

  1. Refactor Vulkan backend to allow multiple contexts (ggerganov#7961)

    * Refactor Vulkan backend to allow multiple contexts
    
    * Fix too many shader groups called validation error in llama3 on AMD and Intel GPUs
    
    * Fix Vulkan debug build error
    0cc4m authored Jun 23, 2024
    Configuration menu
    Copy the full SHA
    45c0e2e View commit details
    Browse the repository at this point in the history
  2. fix CI failures (ggerganov#8066)

    * test-backend-ops : increase cpy max nmse
    
    * server ci : disable thread sanitizer
    slaren authored Jun 23, 2024
    Configuration menu
    Copy the full SHA
    b6b9a8e View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    11318d9 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    6a2f298 View commit details
    Browse the repository at this point in the history
  5. llama : add support for BitnetForCausalLM (ggerganov#7931)

    * hf bitnet v1
    
    * hf bitnet e2e v2
    
    * finish bitnet e2e
    
    * finish f16 hf bitnet e2e
    
    * remove unsed
    
    * finish bitnet i2 e2e
    
    * move i2s to quantize v1
    
    * move i2 to quantize
    
    * clean code
    
    * clean code 2
    
    * fix codestyle
    
    * fix code
    
    * fix
    
    * fix code
    
    * fix merge
    
    * remove unused
    
    * change table name
    
    * fix whitespace
    
    * delete redundant
    
    * i2_s to absmax
    
    * finish i2_s/i8_s vec_dot x86 simd
    
    * i2s->q22
    
    * fix code
    
    * remove block scale
    
    * add dequantize
    
    * fix seq
    
    * update avx2
    
    * remove q2_2
    
    * remove q22_grid
    
    * fix whitespace
    
    * reuse llm_build_kv
    
    * fix bo
    
    ---------
    
    Co-authored-by: root <root@wangjinheng>
    Eddie-Wang1120 and root authored Jun 23, 2024
    Configuration menu
    Copy the full SHA
    e112b61 View commit details
    Browse the repository at this point in the history

Commits on Jun 24, 2024

  1. ggml : remove ggml_task_type and GGML_PERF (ggerganov#8017)

    * ggml : remove ggml_task_type and GGML_PERF
    
    * check abort_callback on main thread only
    
    * vulkan : remove usage of ggml_compute_params
    
    * remove LLAMA_PERF
    slaren authored Jun 24, 2024
    Configuration menu
    Copy the full SHA
    95f57bb View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    77beb4d View commit details
    Browse the repository at this point in the history