[V1] Simpify vision block hash for prefix caching by removing offset from hash #11646
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reuse the example in #11187
Taking a series of 3 blocks as an example: T0,T1,P00,P01 | P02,P03,P04,T2 | T3,P10,P11,P12, where Ti is i-th text token and Pxy is the y-th placeholder token of the x-th image, so this prompt has 2 images (P0 and P1). Assuming the image hash of P0 and P1 is aaa and bbb, respectively, and mm_positions=[(offset=2, length=5), (offset=9, length=3)], the hash of 3 blocks in #11187 is as follows
But offset is redundant and the following hash format is enough:
Simple proof:
We need to distinguish the above example with the following cases:
This hash format has almost the same speed as previous one, but can avoid the confusion about why we need to add offset to the hash.
Benchmark result on H100 with the script in #11187: https://gist.github.com/comaniac/ea26df17fdffa533cf53d53b8455bc31 I ran 3 times for each commit
Note that the following format that only adds the image hash to the first block of this image is also correct, but is a little slower (15.50 reqs/s). The code is here 539e84c