You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi!
I was inquiring about the layer_types config that you used for tinyllama_lckv.json, What was the intuition for this config based on the num_attention_heads and num_key_value_heads and num_hidden_layers for the layer_types and forward_passes and backward_passes ?
The text was updated successfully, but these errors were encountered:
For forward_passes and backward_passes, there is no intuition here -- just the empirical results. We find that regardless of the model size/model structure, forward_pass=7 and backward_passes=2 are the most efficient settings (the least training cost while maintaining performance, see LCKV paper section 4.3, appendix C.2, C.4).
For layer_types, the i-th integer means the layer will use the key-value pair in the i-th layer as the kv cache. So it just corresponds to the w=2 settings in LCKV paper. The layer_types has many design choices, see LCKV paper section 4.1, 4.2 and our new paper.
More details about these configs can be found in the configuration file.
I hope it could help. If it does not answer your question please let me know.
Hi!
I was inquiring about the
layer_types
config that you used fortinyllama_lckv.json
, What was the intuition for this config based on thenum_attention_heads
andnum_key_value_heads
andnum_hidden_layers
for thelayer_types
andforward_passes
andbackward_passes
?The text was updated successfully, but these errors were encountered: