Resource-efficient TDNN Architectures for Audio-visual Speech Recognition

In this paper, we consider the problem of resource-efficient architectures for audio-visual automatic speech recognition (AVSR). Specifically, we complement our earlier work that introduced efficient convolutional neural networks (CNNs) for visual-only speech recognition, by focusing here on the sequence modeling component of the architecture, proposing a novel resource-efficient time-delay neural network (TDNN) that we extend for AVSR. In more detail, we introduce the sTDNN-F module, which combines the factored TDNN (TDNN-F) with grouped fully-connected layers and the shuffle operation. We then develop an AVSR system based on the sTDNN-F, incorporating the efficient CNNs of our earlier work and other standard visual processing and speech recognition modules. We evaluate our approach on the popular TCD-TIMIT corpus, under two speaker-independent training/testing scenarios. Our best sTDNN-F based AVSR system turns out 74% more efficient than a traditional TDNN one and 35% more efficient than TDNN-F, while maintaining similar recognition accuracy and noise robustness, and also significantly outperforming its audio-only counterpart. © 2021 European Signal Processing Conference. All rights reserved.

URI

http://hdl.handle.net/11615/75305

Collections

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19735]

Resource-efficient TDNN Architectures for Audio-visual Speech Recognition

Συγγραφέας

Ημερομηνία

Γλώσσα

DOI

Λέξη-κλειδί

Επιτομή

URI

Collections

Related items

Deep View2View Mapping for View-Invariant Lipreading ﻿

Multimodal fusion and sequence learning for cued speech recognition from videos ﻿

Resource-adaptive deep learning for visual speech recognition ﻿

Deep View2View Mapping for View-Invariant Lipreading

Multimodal fusion and sequence learning for cued speech recognition from videos

Resource-adaptive deep learning for visual speech recognition