Hierarchical Instance Tracking

Locate and Track Instances of Objects and their Parts, Maintaining their Hierarchical Relationships

Overview

We introduce a task that unifies two problems historically examined independently: video instance segmentation (i.e., tracking all instances of predefined categories of objects in videos) and part segmentation (i.e., locating all instances of predefined categories of parts of objects in images). Called hierarchical instance tracking, it entails identifying and tracking all instances of predefined categories of objects and their parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset for this task, consisting of 32,202 instance segmentations of 2,765 unique entities that are tracked in 552 videos and span 40 categories. We expect this new dataset challenge will inspire new algorithmic designs for handling a greater diversity of real-world challenges within a single model. Success can benefit many applications, including for robotics manipulation, human-computer interaction, augmented reality, medical diagnostics, video retrieval, and video editing.

BIV-Priv-HIT Dataset

The BIV-Priv-HIT content can be downloaded below:

  • JSON annotation files (follows COCO formatting):
  • Sample code

Details about each video annotation are in the following format in the JSON files:

 {
 "area": 1761.5,
 "bbox":[537.0,1030.0,128.0,21.0],
 "category_id": 30,
 "id": 9270,
 "image_id": 2090,
 "iscrowd": 0
 "segmentation": [[537,                         1031,541,1046,665,
  1052,660,1038]],
 "instance_id": 2
}
  • instance_id is a specific id associated with each object-part pair that is shared across all frames of the video

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Challenge

Our proposed challenge is designed around the BIV-Priv-HIT dataset.

Task

Given a video featuring a single object, the task is to locate, segment, and track the object and its part instances across the duration of the video.

We provide full ground truth annotations for the training and validation portions of the dataset only.

Evaluation Metrics

The submissions will be evaluated based on the mean J&F value of each object and its parts across all frames. J&F is the the standard metric for VOS, which computes the mean between the Jaccard Index (J) (i.e., aka, intersection over union) and the boundary F-measure (F), the harmonic mean of precision and recall, as shown below

To calculate J (Jaccard Index):

To calculate boundary F-measure (F):

Where

Publication

Contact Us

For any questions, please contact Neelima Prasad (nepr1244@colorado.edu).