Scattering vs. Discrete Cosine Transform Features in Visual Speech Processing

Marcheret E., Potamianos G., Vopicka J., Goel V.

dc.creator	Marcheret E., Potamianos G., Vopicka J., Goel V.	en
dc.date.accessioned	2023-01-31T08:57:18Z
dc.date.available	2023-01-31T08:57:18Z
dc.date.issued	2015
dc.identifier.uri	http://hdl.handle.net/11615/76340
dc.description.abstract	Appearance-based feature extraction constitutes the dominant approach for visual speech representation in a variety of problems, such as automatic speechreading, visual speech detection, and others. To obtain the necessary visual features, typically a rectangular region-of-interest (ROI) containing the speaker’s mouth is first extracted, followed, most commonly, by a discrete cosine transform (DCT) of the ROI pixel values and a feature selection step. The approach, although algorithmically simple and computationally efficient, suffers from lack of DCT invariance to typical ROI deformations, stemming, primarily, from speaker’s head pose variability and small tracking inaccuracies. To address the problem, in this paper, the recently introduced scattering transform is investigated as an alternative to DCT within the appearance-based framework for ROI representation, suitable for visual speech applications. A number of such tasks are considered, namely, visual-only speech activity detection, visual-only and audio-visual sub-phonetic classification, as well as audio-visual speech synchrony detection, all employing deep neural network classifiers with either DCT or scattering-based visual features. Comparative experiments of the resulting systems are conducted on a large audio-visual corpus of frontal face videos, demonstrating, in all cases, the scattering transform superiority over the DCT. © 2015 Auditory-Visual Speech Processing 2015, AVSP 2015, held in conjunction with Facial Analysis and Animation, FAA 2015 - 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015. All rights reserved.	en
dc.language.iso	en	en
dc.source	Auditory-Visual Speech Processing 2015, AVSP 2015, held in conjunction with Facial Analysis and Animation, FAA 2015 - 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015	en
dc.source.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85016058046&partnerID=40&md5=1b72cfc040a8fc66cdbcc0776f93f4a6
dc.subject	Audio systems	en
dc.subject	Deep neural networks	en
dc.subject	Feature extraction	en
dc.subject	Image segmentation	en
dc.subject	Speech processing	en
dc.subject	Speech recognition	en
dc.subject	Audio-visual	en
dc.subject	Audio-visual synchrony	en
dc.subject	Automatic speechreading	en
dc.subject	Region-of-interest	en
dc.subject	Regions of interest	en
dc.subject	Scattering transforms	en
dc.subject	Speech activity detections	en
dc.subject	Speechreading	en
dc.subject	Visual speech	en
dc.subject	Visual speech activity detection	en
dc.subject	Discrete cosine transforms	en
dc.subject	The International Society for Computers and Their Applications (ISCA)	en
dc.title	Scattering vs. Discrete Cosine Transform Features in Visual Speech Processing	en
dc.type	conferenceItem	en

Files in questo item

Files	Dimensione	Formato	Mostra
Nessun files in questo item.

Questo item appare nelle seguenti collezioni

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19705]

Mostra i principali dati dell'item

Scattering vs. Discrete Cosine Transform Features in Visual Speech Processing

Files in questo item

Questo item appare nelle seguenti collezioni

Related items

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions ﻿

Resource-efficient TDNN Architectures for Audio-visual Speech Recognition ﻿

Detecting audio-visual synchrony using deep neural networks ﻿

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

Resource-efficient TDNN Architectures for Audio-visual Speech Recognition

Detecting audio-visual synchrony using deep neural networks