Since the goals of both Moment Retrieval (MR) and Highlight Detection (HD) are to quickly obtain the required content from the video according to user needs, several works have attempted to take advantage of the commonality between both tasks to design transformer-based networks for joint MR and HD. Although these methods achieve impressive performance, they still face some problems: a) Semantic gaps across different modalities. b) Various durations of different query-relevant moments and highlights. c) Smooth transitions among diverse events. To this end, we propose a Cross-modal Multiscale D...