Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability to multimodal large language models? #51

Open
DEBIHOOD opened this issue May 7, 2024 · 0 comments
Open

Scalability to multimodal large language models? #51

DEBIHOOD opened this issue May 7, 2024 · 0 comments

Comments

@DEBIHOOD
Copy link

DEBIHOOD commented May 7, 2024

Hi, i was looking around to see what's new in the image generation field, and i've spotted your paper as quite interesting one! The idea behind predicting next resolution(i feel like it's somewhat similar to how progressive growing GAN, and it's successor StyleGAN have handled it) instead of predicting next flattened token feels quite interesting, moreover predicting them all at once! Not really sure that i'm feeling completely comfortable with methods that work in this pyramid-like structure of low-to-high(Prog. growing GAN, StyleGAN, StyleGAN 2, StyleGAN 3 have shown some of the issues that this architecture can create), U-Net feels a bit more intuitive on that part of spectrum, that it operates on the whole image, but inside it also does these decompositions to lower resolutions, but hey, if it worked, and worked better than everything else that existed prior to that, i like it!
So my question is, we have seen the potential of scalability of transformers, we have seen that 1 big transformer can work better than many small transformers(like with the case of language translation), we even have seen the papers that try to combine it all into one big multimodal transformer(text tokens, image tokens, audio tokens), or even discarding tokens altogether and working in the space of bytes (MambaByte paper, not quite the transformer, but why not).
How can we apply VAR for big multimodal LLM? Or a bimodal for the sake of simplicity(as audio tokens is already flat). With prior methods we can just flatten the image tokens. Add tokens of the autoencoder to the vocabulary of LLM and pass a special tokens ⟨img⟩ ⟨/img⟩ once we are working with images.

By the way, in the paper you also pointed out that by using rejection sampling it achieved better scores, but what exactly is rejection sampling? Is it just applying top-k and CFG vs sampling from original probabilities?

Thanks for your work, looking forward for your text2image sequel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant