Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to convert the text_features into text or input_ids correctly #142

Open
RichardMLuu opened this issue Jun 9, 2024 · 1 comment
Open

Comments

@RichardMLuu
Copy link

您在文中《Align before Fuse: Vision and Language Representation Learning with Momentum Distillation》提到了对图像文本对的可视化,展示了ALBEF模型根据图片输出文本,这说明ALBEF具有这项能力,但ALBEF只在model_vqa.py中有decoder,想知道您是如何生成文本的?
我借鉴了BLIP文章中的文本生成方式,使用huggingface中transformers库中的BERT预训练模型作为text_decoder,代码如下所示,但生成的结果很奇怪,总是相同的几个单词,并且虽然loss已经降低了,但生成出的文本的效果依然很差。

-----------code--------------

text_decoder = BertLMHeadModel.from_pretrained(text_decoder, config=config_decoder)   
num_beams = 3

question_states = text_output.last_hidden_state.repeat_interleave(num_beams, dim=0)
question_atts = torch.ones(question_states.size()[:-1], dtype=torch.long).to(question_states.device)
model_kwargs = {"encoder_hidden_states": question_states, "encoder_attention_mask": question_atts}

bos_ids = torch.full((image.size(0), 1), fill_value=0, device=image.device)

outputs = text_decoder.generate(input_ids=bos_ids,
                                max_length=10,
                                min_length=1,
                                num_beams=num_beams,
                                eos_token_id=self.tokenizer.sep_token_id,
                                pad_token_id=self.tokenizer.pad_token_id,
                                **model_kwargs)
for output in outputs:
    answer = self.tokenizer.decode(output, skip_special_tokens=True)
    print(answer)

-----------result--------------

sung shan shan gang gang gang gang gang gang
and and.......
a truck drives on the road past a utility pole and grassy hill
a snowboarder flies through the air while holding their board with one hand

希望您能告诉我正确的生成文本的方法。

-----------translation--------------
In your paper ‘Align before Fuse: Vision and Language Representation Learning with Momentum Distillation’ you mention visualisation of image-text pairs, showing the ALBEF model outputting text based on images, which suggests that ALBEF has this This suggests that ALBEF has this capability, but ALBEF only has a decoder in model_vqa.py, and I'd like to know how you generate the text?
I borrowed the text generation method from BLIP article, and used the BERT pre-trained model in the transformers library in huggingface as text_decoder, the code is shown as below, but the generated result is very strange, it is always the same words, and although the loss has been lowered, the effect of the generated text is still very poor.
I hope you can tell me the correct way to generate the text.

@Practicing7
Copy link

想问下你解决了吗,我看这个ablef和mplug 都用的私有的tokenizer,而且这些tokenizer好像还无法下载吧

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants