Answer Grounding for VQA

Grounding Answer for Visual Questions


Visual Question Answering (VQA) is the task of returning the answer to a question about an image. While most VQA services only return a natural language answer, we believe it is also valuable for a VQA service to return the region in the image used to arrive at the answer. We call this task of locating the relevant visual evidence answer grounding. We publicly share the VizWiz-VQA-Grounding dataset, the first dataset that visually grounds answers to visual questions asked by people with visual impairments, to encourage community progress in developing algorithmic frameworks..

Numerous applications would be possible if answer groundings were provided in response to visual questions. First, they enable assessment of whether a VQA model reasons based on the correct visual evidence. This is valuable as an explanation as well as to support developers in debugging models. Second, answer groundings enable segmenting the relevant content from the background. This is a valuable precursor for obfuscating the background to preserve privacy, given that photographers can inadvertently capture private information in the background of their images. Third, users could more quickly find the desired information if a service instead magnified the relevant visual evidence. This is valuable in part because answers from VQA services can be insufficient, including because humans suffer from “reporting bias” meaning they describe what they find interesting without understanding what a person/population is seeking.

VizWiz-VQA-Grounding Dataset

The VizWiz-VQA-Grounding dataset includes:

  • Images
  • Annotations
    • Answer grounding boundary points for the most common answer. [JSON files]
    • Answer Grounding area (binary masks) for the most common answer. All binary masks are named using the following scheme: VizWiz_<dataset-split>_<image_id>.png. [PNG files]
    • Metadata indicating all responses (contains_multiple questions, contains_multiple_foci, reasons for not drawing, and answer_grounding) from all crowd workers for each visual question. [Metadata]

Training, Validation, and Test Sets

  • Training Set
    • 6,494 examples
  • Validation Set
    • 1,131 examples
  • Test Set
    • 2,373 examples

The download files are organized as follows:

  • Each JSON annotation record has the following format:
"question":"What is this?",
    "plastic spoons",
    "plastic spoons",
    "plastic spoons",
    "box spoons",
    "plastic spoons",
    "box plastic spoons",
    "plastic spoons",
    "everyday spoons"
"most_common_answer": plastic spoons,
"height": 1632,
"width": 1224,
"answer_grounding": [{"y": 1622.96961328125, "x": 111.38398132324218}, {"y": 463.5968232421875, "x": 310.2839813232422}, {"y": 388.5248232421875, "x": 389.8439813232422}, {"y": 440.7488232421875, "x": 986.5439813232422}, {"y": 1624.92793359375, "x": 818.2439813232422}, {"y": 1622.96961328125, "x": 111.38398132324218}]}
  • answer_grounding denotes all points along the boundary of the answer grounding.
  • You can use this [code] to transform answer_grounding into grounding _masks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.


Our proposed challenge is designed around the aforementioned VizWiz-VQA-Grounding dataset.


Given a visual question (question-image pair) with answer, the task is to return the region in the image used to arrive at the answer. The submissions will be evaluated based on the mean Intersection over Union (IoU) score across all test images. The team which achieves the maximum IoU score wins this challenge.

Submission Instructions

Evaluation Servers

Teams participating in the challenge must submit results for the test portion of the dataset to our evaluation servers, which are hosted on EvalAI. We created different partitions of the test dataset to support different evaluation purposes:

  • Test-dev: this partition consists of 1,00 test visual questions and is available year-round. Each team can upload at most 10 submissions per day to receive evaluation results.
  • Test-challenge: this partition is available for a limited duration before the Computer Vision and Pattern Recognition (CVPR) conference in June 2022 to support the challenge for the VQA task, and contains all 2,373 visual questions in the test dataset. Results on this partition will determine the challenge winners, which will be announced during the VizWiz Grand Challenge workshop hosted at CVPR. Each team can submit at most five results files over the length of the challenge and at most one result per day. The best scoring submitted result for each team will be selected as the team’s final entry for the competition.
  • Test-standard: this partition is available to support algorithm evaluation year-round, and contains all 2,373 visual questions in the test dataset. Each team can submit at most five results files and at most one result per day. Each team can choose to share their results publicly or keep them private. When shared publicly, the best scoring submitted result will be published on the public leaderboard and will be selected as the team’s final entry for the competition.

Uploading Submissions to Evaluation Servers

To submit results, each team will first need to create a single account on EvalAI. On the platform, then click on the “Submit” tab in EvalAI, select the submission phase (“test”), select the results file (i.e., zip file) to upload, fill in required metadata about the method, and then click “Submit”. The evaluation server may take several minutes to process the results. To have the submission results appear on the public leaderboard, check the box under “Show on Leaderboard”.

To view the status of a submission, navigate on the EvalAI platform to the “My Submissions” tab and choose the phase to which the results file was uploaded (i.e., “test”). One of the following statuses should be shown: “Failed” or “Finished”. If the status is “Failed”, please check the “Stderr File” for the submission to troubleshoot. If the status is “Finished”, the evaluation successfully completed and the evaluation results can be downloaded. To do so, select “Result File” to retrieve the aggregated score for the submission phase used (i.e., “test”).

Submission Results Formats

Please submit a ZIP file containing all test results. Each result must be a binary mask in the format of a PNG file.

Mask images are named using the scheme VizWiz_test_<image_id>.png, where <image_id> matches the corresponding VizWiz image for the mask.

Mask images must have the exact same dimensions as their corresponding VizWiz image.

Mask images are encoded as single-channel (grayscale) 8-bit PNGs (to provide lossless compression), where each pixel is either:

  • 0: representing the background of the image, or areas outside the answer grounding
  • 255: representing the foreground of the image, or areas inside answer the grounding

The foreground region in a mask image may be of any size (including the entire image).


The Leaderboard page for the challenge can be found here.


  • Teams are allowed to use external data to train their algorithms. The only exception is that teams are not allowed to use any annotations of the test dataset.
  • Members of the same team are not allowed to create multiple accounts for a single project to submit more than the maximum number of submissions permitted per team on the test-challenge and test-standard datasets. The only exception is if the person is part of a team that is publishing more than one paper describing unrelated methods.


Contact Us

For any questions, comments, or feedback, please send them to Chongyan Chen at