Vision-Language Model-based PolyFormer for Recognizing Visual Questions with Multiple Answer Groundings

Introduction

This paper presents a new method that utilizes the capabilities of Vision-and-Language Transformers (ViLT) and the advanced PolyFormer model to tackle the Single Answer Grounding Challenge in the VQA-Therapy dataset. The initial step of our approach involves employing the ViLT model to predict the possible count of unique responses by considering the input question and image. The PolyFormer model subsequently examines the output from ViLT in conjunction with the image to produce visual answer masks that correspond to the input. The presence of overlap between these masks determines whether the answers have a common grounding. If there is no overlap, it indicates the existence of multiple groundings. Our approach achieved an F1 score of 81.71 on the test-dev set and 80.72 on the VizWiz Grand Challenge test set, positioning our team among the Top3 submissions in the competition.

Proposed Approach

Installation

Here is the list of libraries used in this project:

Inference

Download test set from VQA-AnswerTherapy
Fine-tune the ViLT model on the VQA-AnswerTherapy dataset
Download pretrained PolyFormer-L
Inference VizWiz Grand Challenge test set using inference_VizWiz

Contact

If you have any questions, please feel free to contact Dai Tran ([email protected]).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
figures		figures
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Language Model-based PolyFormer for Recognizing Visual Questions with Multiple Answer Groundings

Introduction

Proposed Approach

Installation

Inference

Contact

About

Releases

Packages

Contributors 2

Languages

daitranskku/VizWiz2024-VQA-AnswerTherapy

Folders and files

Latest commit

History

Repository files navigation

Vision-Language Model-based PolyFormer for Recognizing Visual Questions with Multiple Answer Groundings

Introduction

Proposed Approach

Installation

Inference

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages