Skip to content

Commit

Permalink
Add performance section to the README
Browse files Browse the repository at this point in the history
  • Loading branch information
clebert committed Oct 23, 2023
1 parent 32f865d commit 682ddc3
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@ Suddenly, a kind man came by and saw Lily. He asked her what was wrong. Lily tol
achieved: 724.590 tok/s
```

## Performance

Even though the emphasis is on simple, understandable, and clean implementation, the single-threaded performance is very competitive compared to top-tier implementations as per Aydyn Tairov's [benchmarks](https://engiware.com/benchmark/llama2-ports-extensive-benchmarks-mac-m1-max.html). The multi-threaded variant currently suffers from thread spawning overhead, with new threads being created for each matrix-vector multiplication. The overhead of thread creation, as currently implemented, seems to negate the potential performance gains. Moreover, I have aligned all vectors to the cache line without seeing much impact. I utilize SIMD for most of the vector operations and apply `@setFloatMode(.Optimized)` for all functions dealing with floating-point arithmetic.

## Run Llama 2 7B from Hugging Face

Install `git-lfs` and clone the [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) model from Hugging Face:
Expand Down

0 comments on commit 682ddc3

Please sign in to comment.