0.0.12
github-actions
released this
22 Jan 20:04
·
853 commits
to master
since this release
Lots of fixes and tweaks. Main feature updates:
Model support:
- Basic LoRA support for MoE models
- Support for Orion models (also groundwork for other layernorm models)
- Support for loading/converting from Axolotl checkpoints
Generation/sampling:
- Fused kernels enabled for num_experts = 4
- Option to return probs from streaming generator
- Add top-A sampling
- Add freq/pres penalties
- CFG support in streaming generator
- Disable flash-attn for non-causal attention (fixes left-padding until FA2 implements custom bias)
Testing/evaluation:
- HumanEval test
- Script to compare two models layer by layer (e.g. quantized vs. original model)
- "Standard" ppl test that attempts to mimic text-generation-webui
Conversion:
- VRAM optimizations
- Optimized quantization kernels
IO:
- Cache safetensors context managers for faster loading
- Optional direct IO loader (for very fast arrays)