We propose an artificial intelligence challenge to design algorithms that answer visual questions asked by people who are blind. For this purpose, we introduce the visual question answering (VQA) dataset coming from this population, which we call VizWiz-VQA. It originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. Our proposed challenge addresses the following two tasks for this dataset: predict the answer to a visual question and (2) predict whether a visual question cannot be answered. Ultimately, we hope this work will educate more people about the technological needs of blind people while providing an exciting new opportunity for researchers to develop assistive technologies that eliminate accessibility barriers for blind people.
New, larger version as of January 1, 2020:
20,523 training image/question pairs
205,230 training answer/answer confidence pairs
4,319 validation image/question pairs
43,190 validation answer/answer confidence pairs
8,000 test image/question pairs
Deprecated version as of December 31, 2019:
20,000 training image/question pairs
200,000 training answer/answer confidence pairs
3,173 validation image/question pairs
31,730 validation answer/answer confidence pairs
8,000 test image/question pairs
New versus deprecated: the deprecated version has 12 digits for filenames (e.g., “VizWiz_val_000000028000.jpg”) while the new version has 8 digits for filenames (e.g., “VizWiz_val_00028000.jpg”) .
New dataset only: VizWiz_train_00023431.jpg — VizWiz_train_00023958.jpg are images which include private information that is obfuscated using the ImageNet mean. The questions do not ask about the obfuscated private information in these images.
These files show two ways to assign answer-type: train.json, val.json. “answer_type” is the answer type for the most popular answer in the deprecated version and the most popular answer type for all answers’ answer types in the new version.
Our proposed challenge is designed around the VizWiz-VQA dataset and addresses the following two tasks:
Task 1: Predict Answer to a Visual Question
Given an image and question about it, the task is to predict an accurate answer. Inspired by the VQA challenge, we use the following accuracy evaluation metric:
The team which achieves the maximum average accuracy for all test visual questions wins this challenge.
Task 2: Predict Answerability of a Visual Question
Given an image and question about it, the task is to predict if the visual question cannot be answered (with a confidence score in that prediction). The confidence score provided by a prediction model is for ‘answerable’ and should be in [0,1]. We use Python’s average precision evaluation metric which computes the weighted mean of precisions under a precision-recall curve. The team that achieves the largest average precision score for all test visual questions wins this challenge.
Teams participating in the challenge must submit results for the full 2018 VizWiz-VQA test dataset (i.e., 8,000 visual questions) to our evaluation servers, which are hosted on EvalAI. As done for prior challenges (e.g., VQA, COCO), we created different partitions of the test dataset to support different evaluation purposes:
Test-dev: this partition consists of 4,000 test visual questions and is available year-round. Each team can upload at most 10 submissions per day to receive evaluation results.
Test-challenge: this partition is available for a limited duration before the Computer Vision and Pattern Recognition (CVPR) conference in June 2020 to support the challenge for the VQA task, and contains all 8,000 visual questions in the test dataset. Results on this partition will determine the challenge winners, which will be announced during the Visual Question Answering and Dialog workshop hosted at CVPR. Each team can submit at most five results files over the length of the challenge and at most one result per day. The best scoring submitted result for each team will be selected as the team’s final entry for the competition.
Test-standard: this partition is available to support algorithm evaluation year-round, and contains all 8,000 visual questions in the test dataset. Each team can submit at most five results files and at most one result per day. Each team can choose to share their results publicly or keep them private. When shared publicly, the top submission result will be published to the public leaderboard.
Uploading Submissions to Evaluation Servers
To submit results, each team will first need to create a single account on EvalAI. On the platform, then click on the “Submit” tab in EvalAI, select the submission phase (“test-dev”, “test-challenge”, or “test-standard”), select the results file (i.e., JSON file) to upload, fill in required metadata about the method, and then click “Submit”. The evaluation server may take several minutes to process the results. To have the submission results appear on the public leaderboard, when submitting to “test-standard”, check the box under “Show on Leaderboard”.
To view the status of a submission, navigate on the EvalAI platform to the “My Submissions” tab and choose the phase to which the results file was uploaded (i.e., “test-dev”, “test-challenge”, or “test-standard”). One of the following statuses should be shown: “Failed” or “Finished”. If the status is “Failed”, please check the “Stderr File” for the submission to troubleshoot. If the status is “Finished”, the evaluation successfully completed and the evaluation results can be downloaded. To do so, select “Result File” to retrieve the aggregated accuracy score for the submission phase used (i.e., “test-dev”, “test-challenge”, or “test-standard”).
The submission process is identical when submitting results to the “test-dev”, “test-challenge”, and “test-standard” evaluation servers. Therefore, we strongly recommend submitting your results first to “test-dev” to verify you understand the submission process.
Submission Results Formats
Use the following JSON formats to submit results for both challenge tasks:
New dataset (as of January 1, 2020): the Leaderboard pages for both tasks can be found here.
Deprecated dataset (as of December 31, 2019): the Leaderboard pages for both tasks can be found here.
Teams are allowed to use external data to train their algorithms. The only exception is teams are not allowed to use any annotations of the test dataset.
Members of the same team are not allowed to create multiple accounts for a single project to submit more than the maximum number of submissions permitted per team on the test-challenge and test-standard datasets. The only exception is if the person is part of a team that is publishing more than one paper describing unrelated methods.
The code for algorithm results reported in our CVPR 2018 publication is located here.
The new dataset is described in the following publications, while the deprecated version is based only on the first publication: