Resource-efficient TDNN Architectures for Audio-visual Speech Recognition
Date
2021Language
en
Keyword
Abstract
In this paper, we consider the problem of resource-efficient architectures for audio-visual automatic speech recognition (AVSR). Specifically, we complement our earlier work that introduced efficient convolutional neural networks (CNNs) for visual-only speech recognition, by focusing here on the sequence modeling component of the architecture, proposing a novel resource-efficient time-delay neural network (TDNN) that we extend for AVSR. In more detail, we introduce the sTDNN-F module, which combines the factored TDNN (TDNN-F) with grouped fully-connected layers and the shuffle operation. We then develop an AVSR system based on the sTDNN-F, incorporating the efficient CNNs of our earlier work and other standard visual processing and speech recognition modules. We evaluate our approach on the popular TCD-TIMIT corpus, under two speaker-independent training/testing scenarios. Our best sTDNN-F based AVSR system turns out 74% more efficient than a traditional TDNN one and 35% more efficient than TDNN-F, while maintaining similar recognition accuracy and noise robustness, and also significantly outperforming its audio-only counterpart. © 2021 European Signal Processing Conference. All rights reserved.
Collections
Related items
Showing items related by title, author, creator and subject.
-
Deep View2View Mapping for View-Invariant Lipreading
Koumparoulis A., Potamianos G. (2019)Recently, visual-only and audio-visual speech recognition have made significant progress thanks to deep-learning based, trainable visual front-ends (VFEs), with most research focusing on frontal or near-frontal face videos. ... -
Multimodal fusion and sequence learning for cued speech recognition from videos
Papadimitriou K., Parelli M., Sapountzaki G., Pavlakos G., Maragos P., Potamianos G. (2021)Cued Speech (CS) constitutes a non-vocal mode of communication that relies on lip movements in conjunction with hand positional and gestural cues, in order to disambiguate phonetic information and make it accessible to the ... -
Resource-adaptive deep learning for visual speech recognition
Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E. (2020)We focus on the problem of efficient architectures for lipreading that allow trading-off computational resources for visual speech recognition accuracy. In particular, we make two contributions: First, we introduce ...