We propose a semantic warping-based video object segmentation (SWVOS) method in the 1st Large-scale Video Object Segmentation Challenge 2018. It firstly addresses the problem of generating fine bounding box for each object, and then produces accurate pixel proposal with a mask-refinement network. More specifically, 1) the mask of an annotated object is warped into other frames. Considering that the accuracy of warped masks differs due to the deformations, occlusions, and intensity changes in videos, a Warping Confidence approach is proposed to differentiate the warped masks. 2) The warped masks with a high confidence, i.e., within a fine bounding box, are directly fed into a mask-refinement network to get the proposal masks. 3) The others are coped with to obtain fine bounding boxes: those with a middle confidence are put into a coarse-to-fine (CTF) bounding box pipeline, and those with a low confidence are re-warped using a semantics-selection (SEMS) method. For semantics, the Youtube-2018 dataset and COCO dataset are applied for training. Additionally, when occlusion occurs in CTF and SEMS, a few random frames from previous frames are used to detect and/or rewarp. 4) Multiple proposal masks in each video are merged by taking into account the warping confidence. The experimental results show the effectiveness of the proposed approach on the large-scale Youtube-2018 dataset.