Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions on the paper #6

Open
Darkbblue opened this issue Mar 11, 2024 · 3 comments
Open

Some questions on the paper #6

Darkbblue opened this issue Mar 11, 2024 · 3 comments

Comments

@Darkbblue
Copy link

  1. Why do the features and the meta prompts share the same D? Meta prompts need to have the same dimension as ordinary text encoder embeddings, but the dimension of features is arbitrary due to the flexible choice of blocks to be used for feature extraction.
  2. Can we have an interpretation of the step-by-step refinement? Is there any physical explanation of what's going on if we feed a feature map in the UNet, which is designed to process image latents compressed by the VAE module?
@wwqq
Copy link
Collaborator

wwqq commented Mar 11, 2024

Q1: We feed the features generated by UNet into a 1x1 convolution to adjust them to the same dimension as the meta prompts.
Q2: The step-by-step refinement involves a recurrent refinement training strategy, where the initial output from the UNet is fed back into the same UNet for multiple iterations or loops.
Physically, this refinement process can be understood as an iterative enhancement of the feature representation. Each iteration aims to refine and enrich the feature maps by allowing the model to capture more nuanced and complex patterns within the data. This iterative refinement leads to a progressively more detailed and accurate feature representation that is ultimately used for the visual perception tasks

@Darkbblue
Copy link
Author

Q1: We feed the features generated by UNet into a 1x1 convolution to adjust them to the same dimension as the meta prompts. Q2: The step-by-step refinement involves a recurrent refinement training strategy, where the initial output from the UNet is fed back into the same UNet for multiple iterations or loops. Physically, this refinement process can be understood as an iterative enhancement of the feature representation. Each iteration aims to refine and enrich the feature maps by allowing the model to capture more nuanced and complex patterns within the data. This iterative refinement leads to a progressively more detailed and accurate feature representation that is ultimately used for the visual perception tasks

Thanks for the reply! Now I understand the first question but still am not very sure of the second one. I mean, the UNet is trained to process compressed images. Should it be able to also process features? Does it mean that the features contain meaningful vision structures, making it somehow similar to a compressed image?

@wwqq
Copy link
Collaborator

wwqq commented Mar 13, 2024

Q1: We feed the features generated by UNet into a 1x1 convolution to adjust them to the same dimension as the meta prompts. Q2: The step-by-step refinement involves a recurrent refinement training strategy, where the initial output from the UNet is fed back into the same UNet for multiple iterations or loops. Physically, this refinement process can be understood as an iterative enhancement of the feature representation. Each iteration aims to refine and enrich the feature maps by allowing the model to capture more nuanced and complex patterns within the data. This iterative refinement leads to a progressively more detailed and accurate feature representation that is ultimately used for the visual perception tasks

Thanks for the reply! Now I understand the first question but still am not very sure of the second one. I mean, the UNet is trained to process compressed images. Should it be able to also process features? Does it mean that the features contain meaningful vision structures, making it somehow similar to a compressed image?

Yes, features are extracted and represented to capture the critical information necessary for reconstructing or understanding the original image content. They contain meaningful vision structures like edges, textures, colors, or more abstract patterns in the data and can be akin to a compressed form of the image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants