This paper explores the detection of frame-wise instances of violence in both audio and visual modalities, where only clip-level labels are available. Previous works selected fixed value of frames for objective optimization to model frame-level features, and applied straightforward fusion strategy to aggregate audio and visual information. However, these two issues, namely Constant Frames Selection and Vulnerable Fusion, significantly impair the network’s detection performance. To address these issues, we present a novel framework called Frame...