Image Captioning

Describe Images Taken by People Who Are Blind

A panel of eight images paired with captions.  The first row contains four images with the following captions: "A computer screen with a Windows message about Microsoft license terms", "A photo taken from a residential street in front of some homes with a stormy sky above", and "A blue sky with fluffy clouds, taken from a car while driving on the highway".  The second row contains four images with the following captions: "A hand holds up a can of Coors Light in front of an outdoor scene with a dog on a porch", "A digital thermometer resting on a wooden table, showing 38.5 degrees Celsius", "A Winnie The Pooh character high chair with a can of Yoohoo sitting on it in front of a white wall", and "A cup holder in a car holding loose change from Canada".

Overview

Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of 39,181 images originating from people who are blind that are each paired with 5 captions. Our proposed challenge addresses the task of predicting a suitable caption given an image. Ultimately, we hope this work will educate more people about the technological needs of blind people while providing an exciting new opportunity for researchers to develop assistive technologies that eliminate accessibility barriers for blind people.

Dataset

The VizWiz-Captions dataset includes:

  • 23,431 training images
  • 117,155 training captions
  • 7,750 validation images
  • 38,750 validation captions
  • 8,000 test images
  • 40,000 test captions

The download files is organized as follows:

  • Images: training, validation, and test sets
  • Annotations and APIs:
    • Images are split into two JSON files: train and validation. Captions are publicly shared for the train and validation splits and hidden for the test split. There also is a “text_detected” flag which is set to true for the image if it is set to true for at least three of the five crowdsourced results and false otherwise.
    • APIs are provided to demonstrate how to parse the JSON files and evaluate methods against the ground truth.
    • Details about each image are in the following format:
images = [image]
image = {
    "file_name": "VizWiz_train_00023410.jpg",
    "id": 23410
    "text_detected": true
}
annotations = [annotation]
annotation = {
    "image_id": 23410,
    "id": 117050,
    "caption": "A plastic rewards card lying face down on the floor."
    "is_rejected": false,
    "is_precanned": false,
    "text_detected": true
}


Note: If you downloaded the annotation file before April 17, 2020 that version contained a special character ‘\r’ which can cause caption misalignment when tokenizing each caption. If you have that older vision, please make sure you either download the current annotation file or remove ‘\r’ from all captions.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Challenge

Our proposed challenge is designed around the VizWiz-Captions dataset and addresses the following task:

Task: Image Captioning

Given an image, the task is to predict an accurate caption. Inspired by the COCO-Caption and nocaps challenges, we use the following evaluation metrics: BLEU1-4, METEOR, ROUGE-L, CIDEr-D, and SPICE. In evaluation, we exclude the pre-canned (i.e., “is_precanned”: true) and spam captions (i.e., “is_rejected”: true) from the ground truth caption pool so that methods are not rewarded for predicting such captions (pre-canned text is the following: “Quality issues are too severe to recognize visual content.”). The winner of the challenge will be the team which achieves the maximum CIDEr-D score for all test images.

Submission Instructions

Evaluation Servers

Teams participating in the challenge must submit results for the full VizWiz-Captions test dataset (i.e., 8,000 images) to our evaluation servers, which are hosted on EvalAI. Similar to prior challenges (e.g., VQACOCO), we used this same test dataset to support three evaluation purposes:

  • Test-dev: this partition consists of 4,000 of the test images and is available year-round. Each team can upload at most 10 submissions per day to receive evaluation results. This partition enables teams to test working with the evaluation server and submit as many as results as they want, while avoiding that teams know how their methods will fair on the full test set. 
  • Test-challenge: this partition is available for a limited duration before the Computer Vision and Pattern Recognition (CVPR) conference in June 2020 to support the challenge, and contains all 8,000 images in the test dataset. Results on this partition will determine the challenge winners, which will be announced during the VizWiz Grand Challenge workshop hosted at CVPR. Each team can submit at most five results files over the length of the challenge and at most one result per day. The best scoring submitted result for each team will be selected as the team’s final entry for the competition.
  • Test-standard: this partition is the same as the test-challenge partition (i.e., it contains all 8,000 visual questions in the test dataset), with the only difference that it is available to support algorithm evaluation year-round. Each team can submit at most five results files and at most one result per day. Each team can choose to share their results publicly or keep them private. When shared publicly, the top submission result will be published to the public leaderboard.

Uploading Submissions to Evaluation Servers

To submit results, each team will first need to create a single account on EvalAI. On the platform, then click on the “Submit” tab in EvalAI, select the submission phase (“test-dev”, “test-challenge”, or “test-standard”), select the results file (i.e., JSON file) to upload, fill in required metadata about the method, and then click “Submit”. The evaluation server may take several minutes to process the results. To have the submission results appear on the public leaderboard, when submitting to “test-standard”, check the box under “Show on Leaderboard”.

To view the status of a submission, navigate on the EvalAI platform to the “My Submissions” tab and choose the phase to which the results file was uploaded (i.e., “test-dev”, “test-challenge”, or “test-standard”). One of the following statuses should be shown: “Failed” or “Finished”. If the status is “Failed”, please check the “Stderr File” for the submission to troubleshoot. If the status is “Finished”, the evaluation successfully completed and the evaluation results can be downloaded. To do so, select “Result File” to retrieve the aggregated accuracy score for the submission phase used (i.e., “test-dev”, “test-challenge”, or “test-standard”).

The submission process is identical when submitting results to the “test-dev”, “test-challenge”, and “test-standard” evaluation servers. Therefore, we strongly recommend submitting your results first to “test-dev” to verify you understand the submission process.

Submission Results Formats

Use the following JSON format to submit results for the task:

results = [result]
result = {
    "image_id": 23431,
    "caption": "A computer screen shows a repair prompt on the screen."
}

Leaderboards

The Leaderboard page can be found here.

Rules

  • Teams are allowed to use external data to train their algorithms. The only exception is teams are not allowed to use any annotations of the test dataset.
  • Members of the same team are not allowed to create multiple accounts for a single project to submit more than the maximum number of submissions permitted per team on the test-challenge and test-standard datasets. The only exception is if the person is part of a team that is publishing more than one paper describing unrelated methods.

Code

Code to crowdsource the captions: link.

Code to analyze our dataset: link.

Code for AOA, the top baseline method for our dataset challenge: link.

Publications

The new dataset is described in the following publications:

Captioning Images Taken by People Who Are Blind
Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya.
European Conference on Computer Vision (ECCV), 2020.

“I Hope This Is Helpful”: Understanding Crowdworkers’ Challenges and Motivations for an Image Description Task
Rachel Simons, Danna Gurari, Kenneth R. Fleischmann.
Proceedings of the ACM on Human Computer Interaction (PACM HCI), 2020.

Contact Us

For questions about code, please send them to Yinan Zhao at yinanzhao@utexas.edu.

For other questions, comments, or feedback, please send them to Danna Gurari at danna.gurari@ischool.utexas.edu.