Locate and Track Instances of Objects and their Parts, Maintaining their Hierarchical Relationships

Overview
We introduce a task that unifies two problems historically examined independently: video instance segmentation (i.e., tracking all instances of predefined categories of objects in videos) and part segmentation (i.e., locating all instances of predefined categories of parts of objects in images). Called hierarchical instance tracking, it entails identifying and tracking all instances of predefined categories of objects and their parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset for this task, consisting of 32,202 instance segmentations of 2,765 unique entities that are tracked in 552 videos and span 40 categories. We expect this new dataset challenge will inspire new algorithmic designs for handling a greater diversity of real-world challenges within a single model. Success can benefit many applications, including for robotics manipulation, human-computer interaction, augmented reality, medical diagnostics, video retrieval, and video editing.
BIV-Priv-HIT Dataset
The BIV-Priv-HIT content can be downloaded below:
- Sampled Video Frames
- training (from 327 videos)
- validation (from 87 videos)
- testing (from 138 videos)
- JSON annotation files (follows COCO formatting):
- training – 6,690 annotated frames
- validation – 1,680 annotated frames
- Sample code
- evaluation – includes sample code and file structure
- visualization – code to visualize the annotations
Details about each video annotation are in the following format in the JSON files:
{
"area": 1761.5,
"bbox": [537.0,1031.0, 128.0,21.0],
"category_id": 30,
"id": 9270,
"image_id": 2090,
"iscrowd": 0,
"segmentation": [[537, 1031,541,1046,665,
1052,660,1038]],
"instance_id": 2
}
- instance_id is a specific id associated with each object-part pair that is shared across all frames of the video

This work is licensed under a Creative Commons Attribution 4.0 International License.
Challenge
Our proposed challenge is designed around the BIV-Priv-HIT dataset.
Task
Given a video featuring a single object, the task is to locate, segment, and track the object and its part instances across the duration of the video.
We provide full ground truth annotations for the training and validation portions of the dataset only.
Evaluation Metrics
The submissions will be evaluated based on the mean J&F value of each object and its parts across all frames. J&F is the the standard metric for VOS, which computes the mean between the Jaccard Index (J) (i.e., aka, intersection over union) and the boundary F-measure (F), the harmonic mean of precision and recall, as shown below

To calculate J (Jaccard Index):

To calculate boundary F-measure (F):

Where

Submission Instructions
Your submissions will be evaluated on the testing portion of the dataset.
Leaderboard
Coming soon.
Rules
- Teams are allowed to use external data to train their algorithms. The only exception is that teams are not allowed to use any annotations of the test dataset.
- Members of the same team cannot create multiple accounts for a single project to submit more than the maximum number of submissions permitted per team on the test-challenge and test-standard datasets. The only exception is if the person is part of a team publishing more than one paper describing unrelated methods.
Publication
- Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information
Neelima Prasad, Jarek Tyler Reynolds, Neel Karsanbhai, Tanusree Sharma, Lotus Zhang, Abigale Stangl, Yang Wang, Leah Findlater, and Danna Gurari. IEEE Winter Conference on Applications of Computer Vision (WACV), 2026.
Contact Us
For any questions, please contact Neelima Prasad (nepr1244@colorado.edu).