Resource-adaptive deep learning for visual speech recognition

Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E.

dc.creator	Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E.	en
dc.date.accessioned	2023-01-31T08:45:26Z
dc.date.available	2023-01-31T08:45:26Z
dc.date.issued	2020
dc.identifier	10.21437/Interspeech.2020-3003
dc.identifier.issn	2308457X
dc.identifier.uri	http://hdl.handle.net/11615/75307
dc.description.abstract	We focus on the problem of efficient architectures for lipreading that allow trading-off computational resources for visual speech recognition accuracy. In particular, we make two contributions: First, we introduce MobiLipNetV3, an efficient and accurate lipreading model, based on our earlier work on MobiLipNetV2 and incorporating recent advances in convolutional neural network architectures. Second, we propose a novel recognition paradigm, called MultiRate Ensemble (MRE), that combines a “lean” and a “full” MobiLipNetV3 in the lipreading pipeline, with the latter applied at a lower frame rate. This architecture yields a family of systems offering multiple accuracy vs. efficiency operating points depending on the frame-rate decimation of the “full” model, thus allowing adaptation to the available device resources. We evaluate our approach on the TCD-TIMIT corpus, popular in speaker-independent lipreading of continuous speech. The proposed MRE family of systems can be up to 73 times more efficient compared to residual neural network based lipreading, and up to twice as MobiLipNetV2, while in both cases reaching up to 8% absolute WER reduction, depending on the MRE chosen operating point. For example, a temporal decimation of three yields a 7% absolute WER reduction and a 26% relative decrease in computations over MobiLipNetV2. © 2020 ISCA	en
dc.language.iso	en	en
dc.source	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH	en
dc.source.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85098111271&doi=10.21437%2fInterspeech.2020-3003&partnerID=40&md5=3756544ba0d426bed39827899c5d31fa
dc.subject	Convolutional neural networks	en
dc.subject	Deep learning	en
dc.subject	Network architecture	en
dc.subject	Speech communication	en
dc.subject	Computational resources	en
dc.subject	Continuous speech	en
dc.subject	Device resources	en
dc.subject	Efficient architecture	en
dc.subject	Frame rate	en
dc.subject	Operating points	en
dc.subject	Speaker independents	en
dc.subject	Visual speech recognition	en
dc.subject	Speech recognition	en
dc.subject	International Speech Communication Association	en
dc.title	Resource-adaptive deep learning for visual speech recognition	en
dc.type	conferenceItem	en

Αρχεία σε αυτό το τεκμήριο

Αρχεία	Μέγεθος	Τύπος	Προβολή
Δεν υπάρχουν αρχεία που να σχετίζονται με αυτό το τεκμήριο.

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19735]

Εμφάνιση απλής εγγραφής

Resource-adaptive deep learning for visual speech recognition

Αρχεία σε αυτό το τεκμήριο

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Related items

ATHENA: A Greek multi-sensory database for home automation control ﻿

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions ﻿

Multi-room speech activity detection using a distributed microphone network in domestic environments ﻿

ATHENA: A Greek multi-sensory database for home automation control

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

Multi-room speech activity detection using a distributed microphone network in domestic environments