-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need understanding number of tokens #22
Comments
Hi @miguelcarvtalka, |
Thank you for your reply! Another question: is there a way for the model to understand which tokens are low res images and which tokens are a result of the STC module? Meaning is there a way for the model to distinguish whether a token belongs to a full image or not? For example, just out of the top of my head you could have included extra learned tokens that delimit the full image or the output of the STC module (the tokens that changed from the first frame in the window), you could also sum learned embeddings there... |
@miguelcarvtalka Thank you for sharing your thoughts! You can keep track of the indices of the frames where the tokens have been reduced. However, I’m not entirely clear on the meaning of the last sentence, "sum learned embeddings." |
How do you ensure that the number of tokens don't surpass the max token length defined for the model? In the case of the Llama 3.2 1B decoder model, the max token length seems to be 16k, but from reading the paper no where do you have specified a max number of tokens for the video - everything seems to be threshold based meaning that it seems to be entirely possible to exceed the context window even after STC, right? What do you do in case even after STC the context still exceeds the max defined in the config file?
The text was updated successfully, but these errors were encountered: