Visual Question Answer Differences

Recognizing Why Answers to Visual Questions Differ

Three rows with nine visual questions and ten answers each, along with the corresponding reason why the ten answers differ.
Examples of visual questions (VQs) asked by people who are blind and sighted, and answers from 10 different people. As shown, the answers can differ for a variety of reasons, including because of the VQ (first and second columns) or the answers (third column).


Visual Question Answering is the task of returning the answer to a question about an image. A challenge is that different people often provide different answers to the same visual question. We present a taxonomy of nine plausible reasons explaining why the answers can differ, and create two labelled datasets consisting of ∼45,000 visual questions indicating which reasons led to answer differences. We then propose a novel problem of predicting directly from a visual question which reasons will cause answer differences as well as a novel algorithm for this purpose.

To our knowledge, this is the first work in the computer vision community to characterize, quantify, and model reasons why annotations differ. We believe it will motivate and facilitate future work on related problems that similarly face annotator differences, including image captioning, visual storytelling, and visual dialog. We publicly share the datasets and code to encourage community progress in developing algorithmic frameworks that can account for the diversity of perspectives in a crowd.

Answer-Difference Dataset

The Answer-Difference dataset includes:

  • Images
    • 29,921 from VizWiz dataset
    • 12,816 from VQA 2.0 dataset *
  • Annotations
    • Visual Question (Question-Image Pair) with 10 answers
    • Reasons for answer difference

* Some visual questions in VQA 2.0 ask about the same image.

Training, Validation, and Test Sets

  • Training Set
    • 19,196 from VizWiz
    • 9,772 from VQA 2.0
  • Validation Set
    • 3,048 from VizWiz
    • 1,504 from VQA 2.0
  • Test Set
    • 7,677 from VizWiz *
    • 3,758 from VQA 2.0

* Test set annotations for VizWiz dataset are not publicly shared, since this may provide clues for the ongoing VQA challenge.

Example Code to get started

  • coming soon

The download files are organized as follows:

  • Annotations are split into three JSON files: train, validation, and test.
  • Each annotation record has the following format:
"ans_type": "other",
"qid": "VizWiz_train_000000012255.jpg",
"image": "VizWiz_train_000000012255.jpg",
"question": "What is this please?",
    "answers": [
    "iphone screensaver",
    "cell phone",
    "time date",
    "time date cell phone scene",
    "cell phone screen"
"src_dataset": "VIZWIZ",
"ans_dis_labels": [0, 0, 0, 0, 5, 0, 5, 4, 0, 0]
  • src_dataset can take 2 values: VIZWIZ or VQA
  • ans_dis_labels denote the reasons for answer difference, in the following order:
    The numbers denote how many crowd workers (out of 5) selected a particular reason. Details about the meaning of each label are shared in the paper.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.


Contact Us

For questions about code, please send them to Qing Li at

For other questions, comments, or feedback, please send them to Danna Gurari at