Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need understanding number of tokens #22

Open
miguelcarvtalka opened this issue Nov 18, 2024 · 3 comments
Open

Need understanding number of tokens #22

miguelcarvtalka opened this issue Nov 18, 2024 · 3 comments

Comments

@miguelcarvtalka
Copy link

miguelcarvtalka commented Nov 18, 2024

How do you ensure that the number of tokens don't surpass the max token length defined for the model? In the case of the Llama 3.2 1B decoder model, the max token length seems to be 16k, but from reading the paper no where do you have specified a max number of tokens for the video - everything seems to be threshold based meaning that it seems to be entirely possible to exceed the context window even after STC, right? What do you do in case even after STC the context still exceeds the max defined in the config file?

@xiaoqian-shen
Copy link
Collaborator

Hi @miguelcarvtalka,
If the number of tokens after compression still exceed the context length, we will force truncate the exceed tokens in each sliding window, as implemented here. And in our reported result, we set the model_max_length to be 8k (8192) for fair comparison with baselines.

@miguelcarvtalka
Copy link
Author

miguelcarvtalka commented Nov 19, 2024

Thank you for your reply! Another question: is there a way for the model to understand which tokens are low res images and which tokens are a result of the STC module? Meaning is there a way for the model to distinguish whether a token belongs to a full image or not?

For example, just out of the top of my head you could have included extra learned tokens that delimit the full image or the output of the STC module (the tokens that changed from the first frame in the window), you could also sum learned embeddings there...

@xiaoqian-shen
Copy link
Collaborator

@miguelcarvtalka Thank you for sharing your thoughts! You can keep track of the indices of the frames where the tokens have been reduced. However, I’m not entirely clear on the meaning of the last sentence, "sum learned embeddings."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants