Multimodal fusion and sequence learning for cued speech recognition from videos
Ημερομηνία
2021Γλώσσα
en
Λέξη-κλειδί
Επιτομή
Cued Speech (CS) constitutes a non-vocal mode of communication that relies on lip movements in conjunction with hand positional and gestural cues, in order to disambiguate phonetic information and make it accessible to the speech and hearing impaired. In this study, we address the automatic recognition of CS from videos, employing deep learning techniques and extending our earlier work on this topic as follows: First, for visual feature extraction, in addition to hand positioning embeddings and convolutional neural network-based appearance features of the mouth region and signing hand, we consider structural information of the hand and mouth articulators. Specifically, we utilize the OpenPose framework to extract 2D lip keypoints and hand skeletal coordinates of the signer, and we also infer 3D hand skeletal coordinates from the latter exploiting own earlier work on 2D-to-3D hand-pose regression. Second, we modify the sequence learning model, by considering a time-depth separable (TDS) convolution block structure that encodes the fused visual features, in conjunction with a decoder that is based on connectionist temporal classification for phonetic sequence prediction. We investigate the contribution of the above to CS recognition, evaluating our model on a French and a British English CS video dataset, and we report significant gains over the state-of-the-art on both sets. © Springer Nature Switzerland AG 2021.
Collections
Related items
Showing items related by title, author, creator and subject.
-
A fully convolutional sequence learning approach for cued speech recognition from videos
Papadimitriou K., Potamianos G. (2021)Cued Speech constitutes a sign-based communication variant for the speech and hearing impaired, which involves visual information from lip movements combined with hand positional and gestural cues. In this paper, we consider ... -
Resource-efficient TDNN Architectures for Audio-visual Speech Recognition
Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E. (2021)In this paper, we consider the problem of resource-efficient architectures for audio-visual automatic speech recognition (AVSR). Specifically, we complement our earlier work that introduced efficient convolutional neural ... -
Resource-adaptive deep learning for visual speech recognition
Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E. (2020)We focus on the problem of efficient architectures for lipreading that allow trading-off computational resources for visual speech recognition accuracy. In particular, we make two contributions: First, we introduce ...