Exploring ROI size in deep learning based lipreading

Automatic speechreading systems have increasingly exploited deep learning advances, resulting in dramatic gains over traditional methods. State-of-the-art systems typically employ convolutional neural networks (CNNs), operating on a video region-of-interest (ROI) that contains the speaker’s mouth. However, little or no attention has been paid to the effects of ROI physical coverage and resolution on the resulting recognition performance within the deep learning framework. In this paper, we investigate such choices for a visual-only speech recognition system based on CNNs and long short-term memory models that we present in detail. Further, we employ a separate CNN to perform face detection and facial landmark localization, driving the ROI extraction process. We conduct experiments on a multi-speaker corpus of connected digits utterances, recorded in ideal visual conditions. Our results show that ROI design choices affect automatic speechreading performance significantly: the best visual-only word error rate (5.07%) corresponds to a ROI that contains a large part of the lower face, in addition to just the mouth, and at a relatively high resolution. Noticeably, the result represents a 27% relative error reduction compared to employing the entire lower face as the ROI. © 2017 14th International Conference on Auditory-Visual Speech Processing, AVSP 2017. All rights reserved.

URI

http://hdl.handle.net/11615/75304

Collections

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19735]