2021 VizWiz Grand Challenge Workshop

A panel of examples images for two tasks in the challenge: image captioning and visual question answering. On the left are shown nine images paired with captions for image captioning. The first row contains three images with the following captions: "A computer screen with a Windows message about Microsoft license terms", "A can of green beans is sitting on a counter in a kitchen", "A photo taken from a residential street in front of some homes with a stormy sky above". The second row contains three images with the following captions: "A hand holds up a can of Coors Light in front of an outdoor scene with a dog on a porch", "A digital thermometer resting on a wooden table, showing 38.5 degrees Celsius", "A close up of a Black and silver pocket Kershaw knife sits in a white persons open palm". The third row contains three images with the following captions: "A blue sky with fluffy clouds, taken from a car while driving on the highway", "A baby chair that has cartoon characters on it with a can of yahoo on the table", and "A cup holder in a car holding loose change from Canada".

On the right are shown eight images paired with questions and answers for visual question answering. The first row contains four images with the following question and answer pair: 1. Question: Does this foundation have any sunscreen? Answer: yes; 2. Question: what is this? Answer: 10 euros; 3. Question: what color is this? Answer: green; 4. Question: Please can you tell me what this item is? Answer: butternut squash red pepper soup. The second row contains four images with the following question and answer pair: 1. Question: What type of pills are these? Answer: unsuitable image; 2. Question: What type of soup is this? Answer: unsuitable image; 3. Question: who is this mail for? Answer: unanswerable; 4. Question: when is the expiration date? Answer: unanswerable.

Overview

Our goal for this workshop is to educate researchers about the technological needs of people with vision impairments while empowering researchers to improve algorithms to meet these needs. A key component of this event will be to track progress on two dataset challenges, where the tasks are to answer visual questions and caption images taken by people who are blind. Winners of these challenges will receive awards sponsored by Microsoft. The second key component of this event will be a discussion about current research and application issues, including by invited speakers from both academia and industry who will share about their experiences in building today’s state-of-the-art assistive technologies as well as designing next-generation tools.

Important Dates

Monday, February 1: challenge submissions announced
Friday, May 21 [5:59pm Central Standard Time]: challenge submissions due
Friday, May 21 [5:59pm Central Standard Time]: extended abstracts due
Friday, May 28 [5:59pm Central Standard Time]: notification to authors about decisions for extended abstracts
Saturday, June 19: all-day workshop

Submissions

We invite two types of submissions:

Challenge Submissions

We invite submissions of results from algorithms for both the image captioning challenge task and the visual question answering challenge task. We accept submissions for algorithms that are not published, currently under review, and already published. The teams with the top-performing submissions will be invited to give short talks during the workshop. The top two teams for each challenge will receive financial awards sponsored by Microsoft:

- - 1rst place: $10,000 Microsoft Azure credit
  - 2nd place: $5,000 Microsoft Azure credit

Extended Abstracts

We invite submissions of extended abstracts on topics related to image captioning, visual question answering, and assistive technologies for people with visual impairments. Papers must be at most two pages (with references) and follow the CVPR formatting guidelines using the provided author kit. Reviewing will be single-blind and accepted papers will be presented as posters. We will accept submissions on work that is not published, currently under review, and already published. There will be no proceedings. Please send your extended abstracts to workshop@vizwiz.org.

Please note that we will require all camera-ready content to be accessible via a screen reader. Given that making accessible PDFs and presentations may be a new process for some authors, we will host training sessions beforehand to both educate and assist all authors to succeed in making their content accessible. More details to come soon.

Program

Location:

Event is being held virtually.

Schedule:

All the time below are in Central Time (CT)

9:00-9:10am: Opening remarks
9:10-9:30am: Invited talk by Dhruv Batra
9:30-9:50am: Invited talk by Anna Rohrbach
9:50-10:10am: Invited talk by Cole Gleason
10:10-10:30am: Break
10:30-11:30am: Panel with blind technology advocates
11:30am-12:30pm: Lunch break
12:30-12:40pm: Overview of challenge, winner announcements, and analysis of results
12:40-12:50pm: Talks by top-2 teams for the VizWiz-Captions Challenge 2021
- 1^st place: runner (Alibaba Group, Beihang University)
- 2^nd place: Sparta117(SRC-B) (Samsung)
12:50-1:00pm: Talks by top-2 teams for the VizWiz-VQA Challenge 2021
- 1^st place: DA_Team (Alibaba Group)
- 2^nd place: HSSLAB_INSPUR (Inspur)
1:00-1:15pm: Poster spotlights
1:15-2:00pm: Poster session (Q&A in CVPR virtual platform, registration required)
2:00-2:30pm: Break
2:30-2:50pm: Invited talk by Yue-Ting Siu
2:50-3:10pm: Invited talk by Daniela Massiceti
3:10-3:30pm: Invited talk by Joshua Miele
3:30-3:45pm: Break
3:45-4:45pm: Panel with invited speakers
4:45-4:55pm: Open discussion
4:55-5:00pm: Closing remarks

Invited Speakers and Panelists:

Dhruv Batra
Georgia Tech, Facebook AI Research

Anna Rohrbach
UC Berkeley

Cole Gleason
Apple

Yue-Ting Siu
San Francisco State University

Daniela Massiceti
Microsoft Research

Joshua Miele
Amazon

Peter Slatin
The Slatin Group

Ed Summers
SAS

Nefertiti Matos
New York Public Library

Poster List

An Improved Feature Extraction Approach to Image Captioning for Visually Impaired People
Dong Wook Kim, Joon gwon Hwang, Sang Hyeok Lim, Sang Hun Lee

Cross-Attention with Self-Attention for VizWiz VQA
Rachana Jayaram, Shreya Maheshwari, Hemanth C, Sathvik N Jois, Dr. Mamatha H.R.

Data augmentation to improve robustness of image captioning solutions
Shashank Bujimalla, Mahesh Subedar, Omesh Tickoo

Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation
Jae Won Cho, Dong-Jin Kim, Jinsoo Choi, Yunjae Jung, In So Kweon

Enhancing Textual Cues in Multi-modal Transformers for VQA
Yu Liu, Lianghua Huang, Liuyihang Song, Bin Wang, Yingya Zhang, Pan Pan

Live Photos: Mitigating the Impacts of Low-Quality Images in VQA
Lauren Olson, Chandra Kambhamettu, Kathleen McCoy

Multiple Transformer Mining for VizWiz Image Caption
Xuchao Gong, Hongji Zhu, Yongliang Wang, Biaolong Chen, Aixi Zhang, Fangxun Shu, Si Liu

Deep Co-Attention Model for Challenging Visual Question Answering on VizWiz
Wentao Mo, Yang Liu

Two-stage Refinements for Vizwiz-VQA
Runze Zhang, Xiaochuan Li, Baoyu Fan, Zhenhua Guo, Yaqian Zhao, Rengang Li

Organizers

Danna Gurari
University of Texas at Austin

Jeffrey Bigham
Carnegie Mellon University, Apple

Meredith Morris
Google

Ed Cutrell
Microsoft

Abigale Stangl
University of Washington

Yinan Zhao
University of Texas at Austin

Samreen Anjum
University of Texas at Austin

Contact Us

For questions, comments, or feedback, please send them to Danna Gurari at danna.gurari@colorad o .edu.