Scattering vs. Discrete Cosine Transform Features in Visual Speech Processing

Appearance-based feature extraction constitutes the dominant approach for visual speech representation in a variety of problems, such as automatic speechreading, visual speech detection, and others. To obtain the necessary visual features, typically a rectangular region-of-interest (ROI) containing the speaker’s mouth is first extracted, followed, most commonly, by a discrete cosine transform (DCT) of the ROI pixel values and a feature selection step. The approach, although algorithmically simple and computationally efficient, suffers from lack of DCT invariance to typical ROI deformations, stemming, primarily, from speaker’s head pose variability and small tracking inaccuracies. To address the problem, in this paper, the recently introduced scattering transform is investigated as an alternative to DCT within the appearance-based framework for ROI representation, suitable for visual speech applications. A number of such tasks are considered, namely, visual-only speech activity detection, visual-only and audio-visual sub-phonetic classification, as well as audio-visual speech synchrony detection, all employing deep neural network classifiers with either DCT or scattering-based visual features. Comparative experiments of the resulting systems are conducted on a large audio-visual corpus of frontal face videos, demonstrating, in all cases, the scattering transform superiority over the DCT. © 2015 Auditory-Visual Speech Processing 2015, AVSP 2015, held in conjunction with Facial Analysis and Animation, FAA 2015 - 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015. All rights reserved.

URI

http://hdl.handle.net/11615/76340

Collections

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19705]

Scattering vs. Discrete Cosine Transform Features in Visual Speech Processing

Συγγραφέας

Ημερομηνία

Γλώσσα

Λέξη-κλειδί

Επιτομή

URI

Collections

Related items

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions ﻿

Resource-efficient TDNN Architectures for Audio-visual Speech Recognition ﻿

Detecting audio-visual synchrony using deep neural networks ﻿

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

Resource-efficient TDNN Architectures for Audio-visual Speech Recognition

Detecting audio-visual synchrony using deep neural networks