Our method consists of three steps: image based instance segmentation,
multiple object tracking and score reranking.

For image-based instance segmentation, we use Cascade R-CNN with
HRNetV2p-W40 as the backbone.
External dataset for training: Coco-2017 (except the person category
and the categories which do not in the YouTube-VOS), OVIS, Openimages
(except the person category and the categories which do not in the
YouTube-VOS).

Multiple-scale testing: four scales, [(2000, 1200), (1400, 1000),
(1400, 800), (1400, 600)].
Training strategies: multiple-scale training, CosineAnnealing, etc.

For multiple object tracking, we use mask iou matching, box iou
matching and feature matching. We match the detections of the current
frame with those of the previous five frames.

For score reranking, we define the trajectory score w.r.t the trajectory length.