Proposal, Tracking & Segmentation (PTS): A cascaded network for video object segmentation We propose a novel video object segmentation architecture which can capture long-term temporal information and handle scale variance in objects. The proposed architecture consists of three components: Region Proposal Network, Object Tracking Network, and Reference-Guided Segmentation Network. The Region Proposal Network is pre-trained on COCO dataset and provides object candidate boxes. Inspired by MDNet, Object Tracking Network is designed to score the candidate boxes and updated online for adapting to large and fast changes in object appearance. Then, the box with the highest score is selected to crop and resize the frame for normalizing the scale variation of objects. Finally, Reference-Guided Segmentation Network will make use of both cropped region with previous mask and the reference frame to segment target object. Besides YouTube-VOS dataset, COCO dataset is used to pre-train Region Proposal Network in this competition.