Hierarchical Instance Tracking

Locate and Track Object and Part Instances

Overview

While a key focus of the segmentation literature until recently was on locating objects in images, two distinct more challenging problems have since emerged: (1) locating parts of objects in images and (2) tracking objects in videos. Rather than focusing on these problems in isolation, we propose to instead study them jointly. We call this task of tracking every instance of an object and its parts hierarchical instance tracking.

Hierarchical instance tracking (HIT) unifies three distinct tasks into the same framework: object instance segmentation, part instance segmentation, and entity tracking. We introduce the first benchmark dataset for this task, consisting of 32,202 instance segmentations of 2,765 unique entities that are tracked in 552 videos and span 40 categories. We found that our new dataset is challenging for modern models. Our fine-grained analysis shows that the best-performing model struggles most for tracking parts, especially those that occupy a small number of pixels.

We expect our new dataset challenge will inspire new algorithmic designs for handling a greater diversity of real-world challenges within a single model. Success with infusing such finer-grained, hierarchical segmentation representations when tracking content in videos can benefit many applications, including for robotics manipulation, human-computer interaction, augmented reality, medical diagnostics, video retrieval, and video editing. Success can also benefit downstream video analysis tasks relying on entity tracking, including for action recognition, pose estimation, object re-identification, and video summarization.

BivPriv-HIT Dataset

Videos
- training, validation, testing

Sampled Video Frames
- training, validation, testing

Annotated Video Frames: [training, validation]
- Training – 6,690 annotated training frames across 327 videos.
- Validation – 1,680 annotated validation frames across 87 videos.
- Testing – 2,795 annotated test frames across 138 videos.

JSON annotation files follow COCO formatting.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Challenge

Our proposed challenge is designed around the aforementioned BIVPriv-HIT dataset.

Task

Given a video featuring a single privacy-centric object, the task is to locate, segment, and track the object and its part instances across the duration of the video. The submissions will be evaluated based on the mean Intersection over Union (IoU) score across all test images. The team which achieves the maximum IoU score wins this challenge.

Submission Instructions

Evaluation Servers

Teams participating in the challenge must submit results for the test portion of the dataset to our evaluation servers, which are hosted on EvalAI. We created different partitions of the test dataset to support different evaluation purposes:

Test-dev: this partition consists of 25% (715 annotated frames across 35 video examples) of the testing dataset and is available year-round. Each team can upload at most ten submissions per day to receive evaluation results.
Test-challenge: this partition is available for a limited duration before the Computer Vision and Pattern Recognition (CVPR) conference in June 2025 to support the challenge for the BIVPriv-HIT task and contains all 2,795 annotated test frames across 138 videos in the testing dataset split. Each team can submit at most five results files over the length of the challenge and at most one result per day. The best scoring submitted result for each team will be selected as the team’s final entry for the competition.
Test-standard: this partition is available to support algorithm evaluation year-round, and contains all 2,795 annotated test frames across 138 videos in the testing dataset split. Each team can submit at most five results files and at most one result per day. Each team can choose to share their results publicly or keep them private. When shared publicly, the best scoring submitted result will be published on the public leaderboard and will be selected as the team’s final entry for the competition.

Uploading Submissions to Evaluation Servers

To submit results, each team must create a single account on EvalAI. On the platform, then click on the “Submit” tab in EvalAI, select the submission phase (“test”), select the results file (i.e., zip file) to upload, fill in the required metadata about the method, and then click “Submit.” The evaluation server may take several minutes to process the results. To have the submission results appear on the public leaderboard, check the box under “Show on Leaderboard.”

To view the status of a submission, navigate on the EvalAI platform to the “My Submissions” tab and choose the phase to which the results file was uploaded (i.e., “test”). One of the following statuses should be shown: “Failed” or “Finished”. If the status is “Failed”, please check the “Stderr File” for the submission to troubleshoot. If the status is “Finished”, the evaluation successfully completed and the evaluation results can be downloaded. To do so, select “Result File” to retrieve the aggregated score for the submission phase used (i.e., “test”).

Submission Results Formats

Please submit a ZIP file containing all test results. Each result must be a binary mask in a PNG file format. Binary masks must be 720px by 720px to match the ground truth masks for proper training and evaluation.

Mask images are named using the scheme VizWiz_test_<image_id>.png, where <image_id> matches the corresponding VizWiz image for the mask.

Mask images are encoded as single-channel (grayscale) 8-bit PNGs (to provide lossless compression), where each pixel is either:

0: representing areas that are not the salient object.
255: representing areas that are salient object

The foreground region in a mask image may be any size (including the entire image).

Leaderboards

The Leaderboard page for the challenge can be found here.

Rules

Teams are allowed to use external data to train their algorithms. The only exception is that teams are not allowed to use any annotations of the test dataset.
Members of the same team cannot create multiple accounts for a single project to submit more than the maximum number of submissions permitted per team on the test-challenge and test-standard datasets. The only exception is if the person is part of a team publishing more than one paper describing unrelated methods.

Publication

Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information
Neelima Prasad, Jarek Tyler Reynolds, Neel Karsanbhai, Tanusree Sharma, Lotus Zhang, Abigale Stangl, Yang Wang, Leah Findlater, Danna Gurari. Winter Conference on Applications of Computer Vision (WACV), 2026.

Contact Us

For any questions, comments, or feedback, please contact:

Jarek Reynolds at jare1686@colorado.edu.

Neelima Prasad at nepr1244@colorado.edu