Hierarchical Instance Tracking

Locate and Track Instances of Objects and their Parts, Maintaining their Hierarchical Relationships

Overview

We introduce a task that unifies two problems historically examined independently: video instance segmentation (i.e., tracking all instances of predefined categories of objects in videos) and part segmentation (i.e., locating all instances of predefined categories of parts of objects in images). Called hierarchical instance tracking, it entails identifying and tracking all instances of predefined categories of objects and their parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset for this task, consisting of 32,202 instance segmentations of 2,765 unique entities that are tracked in 552 videos and span 40 categories. We expect this new dataset challenge will inspire new algorithmic designs for handling a greater diversity of real-world challenges within a single model. Success can benefit many applications, including for robotics manipulation, human-computer interaction, augmented reality, medical diagnostics, video retrieval, and video editing.

BIV-Priv-HIT Dataset

The BIV-Priv-HIT content can be downloaded below:

Sampled Video Frames
- training (from 327 videos)
- validation (from 87 videos)
- testing (from 138 videos)

JSON annotation files (follows COCO formatting):
- training – 6,690 annotated frames
- validation – 1,680 annotated frames
- test – 2,795 annotated frames

Sample code
- evaluation – includes sample code and file structure
- visualization – code to visualize the annotations

Details about each video annotation are in the following format in the JSON files:

 {
 "area": 1761.5,
 "bbox":[537.0,1030.0,128.0,21.0],
 "category_id": 30,
 "id": 9270,
 "image_id": 2090,
 "iscrowd": 0
 "segmentation": [[537,                         1031,541,1046,665,
  1052,660,1038]],
 "instance_id": 2
}

instance_id is a specific id associated with each object-part pair that is shared across all frames of the video

This work is licensed under a Creative Commons Attribution 4.0 International License.

Challenge

Our proposed challenge is designed around the BIV-Priv-HIT dataset.

Task

Given a video featuring a single object, the task is to locate, segment, and track the object and its part instances across the duration of the video.

We provide full ground truth annotations for the training and validation portions of the dataset only.

Evaluation Metrics

The submissions will be evaluated based on the mean J&F value of each object and its parts across all frames. J&F is the the standard metric for VOS, which computes the mean between the Jaccard Index (J) (i.e., aka, intersection over union) and the boundary F-measure (F), the harmonic mean of precision and recall, as shown below

To calculate J (Jaccard Index):

To calculate boundary F-measure (F):

Where

Publication

Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information
Neelima Prasad, Jarek Tyler Reynolds, Neel Karsanbhai, Tanusree Sharma, Lotus Zhang, Abigale Stangl, Yang Wang, Leah Findlater, and Danna Gurari. IEEE Winter Conference on Applications of Computer Vision (WACV), 2026.

Contact Us

For any questions, please contact Neelima Prasad (nepr1244@colorado.edu).