Zur Kurzanzeige

dc.creatorKoumparoulis A., Potamianos G., Mroueh Y., Rennie S.J.en
dc.date.accessioned2023-01-31T08:45:25Z
dc.date.available2023-01-31T08:45:25Z
dc.date.issued2017
dc.identifier10.21437/AVSP.2017-13
dc.identifier.urihttp://hdl.handle.net/11615/75304
dc.description.abstractAutomatic speechreading systems have increasingly exploited deep learning advances, resulting in dramatic gains over traditional methods. State-of-the-art systems typically employ convolutional neural networks (CNNs), operating on a video region-of-interest (ROI) that contains the speaker’s mouth. However, little or no attention has been paid to the effects of ROI physical coverage and resolution on the resulting recognition performance within the deep learning framework. In this paper, we investigate such choices for a visual-only speech recognition system based on CNNs and long short-term memory models that we present in detail. Further, we employ a separate CNN to perform face detection and facial landmark localization, driving the ROI extraction process. We conduct experiments on a multi-speaker corpus of connected digits utterances, recorded in ideal visual conditions. Our results show that ROI design choices affect automatic speechreading performance significantly: the best visual-only word error rate (5.07%) corresponds to a ROI that contains a large part of the lower face, in addition to just the mouth, and at a relatively high resolution. Noticeably, the result represents a 27% relative error reduction compared to employing the entire lower face as the ROI. © 2017 14th International Conference on Auditory-Visual Speech Processing, AVSP 2017. All rights reserved.en
dc.language.isoenen
dc.source14th International Conference on Auditory-Visual Speech Processing, AVSP 2017en
dc.source.urihttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85133474755&doi=10.21437%2fAVSP.2017-13&partnerID=40&md5=44329219486105347b83fed928725790
dc.subjectFace recognitionen
dc.subjectImage segmentationen
dc.subjectSpeech processingen
dc.subjectSpeech recognitionen
dc.subjectConvolutional neural networken
dc.subjectDeep learningen
dc.subjectLipreadingen
dc.subjectLSTMen
dc.subjectPerformanceen
dc.subjectRegion-of-interesten
dc.subjectRegions of interesten
dc.subjectSpeechreadingen
dc.subjectState-of-the-art systemen
dc.subjectVisual speech recognitionen
dc.subjectLong short-term memoryen
dc.subjectThe International Society for Computers and Their Applications (ISCA)en
dc.titleExploring ROI size in deep learning based lipreadingen
dc.typeconferenceItemen


Dateien zu dieser Ressource

DateienGrößeFormatAnzeige

Zu diesem Dokument gibt es keine Dateien.

Das Dokument erscheint in:

Zur Kurzanzeige