Scattering vs. Discrete Cosine Transform Features in Visual Speech Processing
Ημερομηνία
2015Γλώσσα
en
Λέξη-κλειδί
Επιτομή
Appearance-based feature extraction constitutes the dominant approach for visual speech representation in a variety of problems, such as automatic speechreading, visual speech detection, and others. To obtain the necessary visual features, typically a rectangular region-of-interest (ROI) containing the speaker’s mouth is first extracted, followed, most commonly, by a discrete cosine transform (DCT) of the ROI pixel values and a feature selection step. The approach, although algorithmically simple and computationally efficient, suffers from lack of DCT invariance to typical ROI deformations, stemming, primarily, from speaker’s head pose variability and small tracking inaccuracies. To address the problem, in this paper, the recently introduced scattering transform is investigated as an alternative to DCT within the appearance-based framework for ROI representation, suitable for visual speech applications. A number of such tasks are considered, namely, visual-only speech activity detection, visual-only and audio-visual sub-phonetic classification, as well as audio-visual speech synchrony detection, all employing deep neural network classifiers with either DCT or scattering-based visual features. Comparative experiments of the resulting systems are conducted on a large audio-visual corpus of frontal face videos, demonstrating, in all cases, the scattering transform superiority over the DCT. © 2015 Auditory-Visual Speech Processing 2015, AVSP 2015, held in conjunction with Facial Analysis and Animation, FAA 2015 - 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015. All rights reserved.
Collections
Related items
Showing items related by title, author, creator and subject.
-
Audio-visual speech recognition using depth information from the Kinect in noisy video conditions
Galatas, G.; Potamianos, G.; Makedon, F. (2012)In this paper we build on our recent work, where we successfully incorporated facial depth data of a speaker captured by the Microsoft Kinect device, as a third data stream in an audio-visual automatic speech recognizer. ... -
Resource-efficient TDNN Architectures for Audio-visual Speech Recognition
Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E. (2021)In this paper, we consider the problem of resource-efficient architectures for audio-visual automatic speech recognition (AVSR). Specifically, we complement our earlier work that introduced efficient convolutional neural ... -
Detecting audio-visual synchrony using deep neural networks
Marcheret E., Potamianos G., Vopicka J., Goel V. (2015)In this paper, we address the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not. This is of interest in a wide range of applications, for example ...