Hierarchical Instance Tracking

Locate and Track Instances of Objects and their Parts, Maintaining their Hierarchical Relationships

Overview

We introduce a task that unifies two problems historically examined independently: video instance segmentation (i.e., tracking all instances of predefined categories of objects in videos) and part segmentation (i.e., locating all instances of predefined categories of parts of objects in images). Called hierarchical instance tracking, it entails identifying and tracking all instances of predefined categories of objects and their parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset for this task, consisting of 32,202 instance segmentations of 2,765 unique entities that are tracked in 552 videos and span 40 categories. We expect this new dataset challenge will inspire new algorithmic designs for handling a greater diversity of real-world challenges within a single model. Success can benefit many applications, including for robotics manipulation, human-computer interaction, augmented reality, medical diagnostics, video retrieval, and video editing.

BIV-Priv-HIT Dataset

The BIV-Priv-HIT content can be downloaded below:

  • JSON annotation files (follows COCO formatting):
  • Sample code

Details about each video annotation are in the following format in the JSON files:

 {
 "area": 1761.5,
 "bbox": [537.0,1031.0,                 128.0,21.0],
 "category_id": 30,
 "id": 9270,
 "image_id": 2090,
 "iscrowd": 0,
 "segmentation": [[537,                           1031,541,1046,665,
  1052,660,1038]],
 "instance_id": 2
}
  • instance_id is a specific id associated with each object-part pair that is shared across all frames of the video

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Challenge

Our proposed challenge is designed around the BIV-Priv-HIT dataset.

Task

Given a video featuring a single object, the task is to locate, segment, and track the object and its part instances across the duration of the video.

We provide full ground truth annotations for the training and validation portions of the dataset only.

Evaluation Metrics

The submissions will be evaluated based on the mean J&F value of each object and its parts across all frames. J&F is the the standard metric for VOS, which computes the mean between the Jaccard Index (J) (i.e., aka, intersection over union) and the boundary F-measure (F), the harmonic mean of precision and recall, as shown below

To calculate J (Jaccard Index):

To calculate boundary F-measure (F):

Where

Submission Instructions

Your submissions will be evaluated on the testing portion of the dataset.

Leaderboard

Coming soon.

Rules

  • Teams are allowed to use external data to train their algorithms. The only exception is that teams are not allowed to use any annotations of the test dataset.
  • Members of the same team cannot create multiple accounts for a single project to submit more than the maximum number of submissions permitted per team on the test-challenge and test-standard datasets. The only exception is if the person is part of a team publishing more than one paper describing unrelated methods.

Publication

Contact Us

For any questions, please contact Neelima Prasad (nepr1244@colorado.edu).