Automatic facial expression recognition (FER) plays a crucial role in realizing the adaptable and individualized tutoring in affective computer-based learning environment. Although many research efforts have been conducted to enhance a greater understanding of FER, a successful accurate recognition of the spontaneous facial expressions in real e-learning environment is still challenging due to its low change in intensity and short duration. In this paper, we propose a new dual-modality spatiotemporal feature representation learning for recogniz...