Image Captioning

Describe Images Taken by People Who Are Blind

A panel of eight images paired with captions.  The first row contains four images with the following captions: "A computer screen with a Windows message about Microsoft license terms", "A photo taken from a residential street in front of some homes with a stormy sky above", and "A blue sky with fluffy clouds, taken from a car while driving on the highway".  The second row contains four images with the following captions: "A hand holds up a can of Coors Light in front of an outdoor scene with a dog on a porch", "A digital thermometer resting on a wooden table, showing 38.5 degrees Celsius", "A Winnie The Pooh character high chair with a can of Yoohoo sitting on it in front of a white wall", and "A cup holder in a car holding loose change from Canada".

Overview

Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of 39,181 images originating from people who are blind that are each paired with 5 captions. Our proposed challenge addresses the task of predicting a suitable caption given an image. Ultimately, we hope this work will educate more people about the technological needs of blind people while providing an exciting new opportunity for researchers to develop assistive technologies that eliminate accessibility barriers for blind people.

Dataset

The VizWiz-Captions dataset includes:

  • 23,431 training images
  • 117,155 training captions
  • 7,750 validation images
  • 38,750 validation captions
  • 8,000 test images
  • 40,000 test captions

The download files is organized as follows:

  • Images: training, validation, and test sets
  • Annotations and APIs:
    • Images are split into two JSON files: train and validation. Captions are publicly shared for the train and validation splits and hidden for the test split. There also is a “text_detected” flag which is set to true for the image if it is set to true for at least three of the five crowdsourced results and false otherwise.
    • APIs are provided to demonstrate how to parse the JSON files and evaluate methods against the ground truth.
    • Details about each image are in the following format:
images = [image]
image = {
    "file_name": "VizWiz_train_00023410.jpg",
    "id": 23410
    "text_detected": true
}
annotations = [annotation]
annotation = {
    "image_id": 23410,
    "id": 117050,
    "caption": "A plastic rewards card lying face down on the floor."
    "is_rejected": false,
    "is_precanned": false,
    "text_detected": true
}

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Challenge

Our proposed challenge is designed around the VizWiz-Captions dataset and addresses the following task:

Task: Image Captioning

Given an image, the task is to predict an accurate caption. Inspired by the COCO-Caption and nocaps challenges, we use the following evaluation metrics: BLEU1-4, METEOR, ROUGE-L, CIDEr-D, and SPICE. In evaluation, we exclude the pre-canned (i.e., “is_precanned”: true) and spam captions (i.e., “is_rejected”: true) from the ground truth caption pool so that methods are not rewarded for predicting such captions (pre-canned text is the following: “Quality issues are too severe to recognize visual content.”). The winner of the challenge will be the team which achieves the maximum CIDEr score for all test images.

Submission Instructions

Evaluation Servers

Teams participating in the challenge must submit results for the full VizWiz-Captions test dataset (i.e., 8,000 images) to our evaluation servers, which are hosted on EvalAI. As done for prior challenges (e.g., VQACOCO), we created different partitions of the test dataset to support different evaluation purposes:

  • Test-dev: this partition consists of 4,000 test images and is available year-round. Each team can upload at most 10 submissions per day to receive evaluation results.
  • Test-challenge: this partition is available for a limited duration before the Computer Vision and Pattern Recognition (CVPR) conference in June 2020 to support the challenge, and contains all 8,000 images in the test dataset. Results on this partition will determine the challenge winners, which will be announced during the VizWiz Grand Challenge workshop hosted at CVPR. Each team can submit at most five results files over the length of the challenge and at most one result per day. The best scoring submitted result for each team will be selected as the team’s final entry for the competition.
  • Test-standard: this partition is available to support algorithm evaluation year-round, and contains all 8,000 visual questions in the test dataset. Each team can submit at most five results files and at most one result per day. Each team can choose to share their results publicly or keep them private. When shared publicly, the top submission result will be published to the public leaderboard.

Uploading Submissions to Evaluation Servers

To submit results, each team will first need to create a single account on EvalAI. On the platform, then click on the “Submit” tab in EvalAI, select the submission phase (“test-dev”, “test-challenge”, or “test-standard”), select the results file (i.e., JSON file) to upload, fill in required metadata about the method, and then click “Submit”. The evaluation server may take several minutes to process the results. To have the submission results appear on the public leaderboard, when submitting to “test-standard”, check the box under “Show on Leaderboard”.

To view the status of a submission, navigate on the EvalAI platform to the “My Submissions” tab and choose the phase to which the results file was uploaded (i.e., “test-dev”, “test-challenge”, or “test-standard”). One of the following statuses should be shown: “Failed” or “Finished”. If the status is “Failed”, please check the “Stderr File” for the submission to troubleshoot. If the status is “Finished”, the evaluation successfully completed and the evaluation results can be downloaded. To do so, select “Result File” to retrieve the aggregated accuracy score for the submission phase used (i.e., “test-dev”, “test-challenge”, or “test-standard”).

The submission process is identical when submitting results to the “test-dev”, “test-challenge”, and “test-standard” evaluation servers. Therefore, we strongly recommend submitting your results first to “test-dev” to verify you understand the submission process.

Submission Results Formats

Use the following JSON format to submit results for the task:

results = [result]
result = {
    "image_id": 23431,
    "caption": "A computer screen shows a repair prompt on the screen."
}

Leaderboards

The Leaderboard page can be found here.

Rules

  • Teams are allowed to use external data to train their algorithms. The only exception is teams are not allowed to use any annotations of the test dataset.
  • Members of the same team are not allowed to create multiple accounts for a single project to submit more than the maximum number of submissions permitted per team on the test-challenge and test-standard datasets. The only exception is if the person is part of a team that is publishing more than one paper describing unrelated methods.

Code

The code used to crowdsource the captions is at this link.

Publications

The new dataset is described in the following publication (to appear soon):

Captioning Images Taken by People Who Are Blind
Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. To appear soon, 2020.

Contact Us

For questions about code, please send them to Yinan Zhao at yzhao@austin.utexas.edu.

For other questions, comments, or feedback, please send them to Danna Gurari at danna.gurari@ischool.utexas.edu.