Resource-efficient TDNN Architectures for Audio-visual Speech Recognition

Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E.

dc.creator	Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E.	en
dc.date.accessioned	2023-01-31T08:45:25Z
dc.date.available	2023-01-31T08:45:25Z
dc.date.issued	2021
dc.identifier	10.23919/EUSIPCO54536.2021.9616215
dc.identifier.isbn	9789082797060
dc.identifier.issn	22195491
dc.identifier.uri	http://hdl.handle.net/11615/75305
dc.description.abstract	In this paper, we consider the problem of resource-efficient architectures for audio-visual automatic speech recognition (AVSR). Specifically, we complement our earlier work that introduced efficient convolutional neural networks (CNNs) for visual-only speech recognition, by focusing here on the sequence modeling component of the architecture, proposing a novel resource-efficient time-delay neural network (TDNN) that we extend for AVSR. In more detail, we introduce the sTDNN-F module, which combines the factored TDNN (TDNN-F) with grouped fully-connected layers and the shuffle operation. We then develop an AVSR system based on the sTDNN-F, incorporating the efficient CNNs of our earlier work and other standard visual processing and speech recognition modules. We evaluate our approach on the popular TCD-TIMIT corpus, under two speaker-independent training/testing scenarios. Our best sTDNN-F based AVSR system turns out 74% more efficient than a traditional TDNN one and 35% more efficient than TDNN-F, while maintaining similar recognition accuracy and noise robustness, and also significantly outperforming its audio-only counterpart. © 2021 European Signal Processing Conference. All rights reserved.	en
dc.language.iso	en	en
dc.source	European Signal Processing Conference	en
dc.source.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85123160449&doi=10.23919%2fEUSIPCO54536.2021.9616215&partnerID=40&md5=450341a7da64e25d2560fcb5babbe19a
dc.subject	Audio acoustics	en
dc.subject	Convolutional neural networks	en
dc.subject	Network architecture	en
dc.subject	Speech recognition	en
dc.subject	Audio-visual	en
dc.subject	Audio-visual automatic speech recognition	en
dc.subject	Audiovisual speech recognition	en
dc.subject	Automatic speech recognition	en
dc.subject	Automatic speech recognition system	en
dc.subject	Convolutional neural network	en
dc.subject	Mobilipnet	en
dc.subject	Neural network architecture	en
dc.subject	Resource-efficient	en
dc.subject	Time delay neural networks	en
dc.subject	Computational efficiency	en
dc.subject	European Signal Processing Conference, EUSIPCO	en
dc.title	Resource-efficient TDNN Architectures for Audio-visual Speech Recognition	en
dc.type	conferenceItem	en

Αρχεία σε αυτό το τεκμήριο

Αρχεία	Μέγεθος	Τύπος	Προβολή
Δεν υπάρχουν αρχεία που να σχετίζονται με αυτό το τεκμήριο.

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19705]

Εμφάνιση απλής εγγραφής

Resource-efficient TDNN Architectures for Audio-visual Speech Recognition

Αρχεία σε αυτό το τεκμήριο

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Related items

Deep View2View Mapping for View-Invariant Lipreading ﻿

Multimodal fusion and sequence learning for cued speech recognition from videos ﻿

Resource-adaptive deep learning for visual speech recognition ﻿

Deep View2View Mapping for View-Invariant Lipreading

Multimodal fusion and sequence learning for cued speech recognition from videos

Resource-adaptive deep learning for visual speech recognition