Visual Dialog aims to answer an appropriate response based on a multi-round dialog history and a given image. Existing methods either focus on semantic interaction, or implicitly capture coarse-grained structural interaction (e.g., pronoun co-references). The fine-grained and explicit structural interaction feature for dialog history is seldom explored, resulting in insufficient feature learning and difficulty in capturing precise context. To address these issues, we propose a structure-aware dual-level graph interactive network (SDGIN) that integrates verb-specific semantic roles and co-refer...