0.0.12

github-actions released this 22 Jan 20:04

· 853 commits to master since this release

c319655

Lots of fixes and tweaks. Main feature updates:

Model support:

Basic LoRA support for MoE models
Support for Orion models (also groundwork for other layernorm models)
Support for loading/converting from Axolotl checkpoints

Generation/sampling:

Fused kernels enabled for num_experts = 4
Option to return probs from streaming generator
Add top-A sampling
Add freq/pres penalties
CFG support in streaming generator
Disable flash-attn for non-causal attention (fixes left-padding until FA2 implements custom bias)

Testing/evaluation:

HumanEval test
Script to compare two models layer by layer (e.g. quantized vs. original model)
"Standard" ppl test that attempts to mimic text-generation-webui

Conversion:

VRAM optimizations
Optimized quantization kernels

IO:

Cache safetensors context managers for faster loading
Optional direct IO loader (for very fast arrays)

Assets 31