Audio-visual speech recognition incorporating facial depth information captured by the Kinect

Galatas, G.; Potamianos, G.; Makedon, F.

dc.creator	Galatas, G.	en
dc.creator	Potamianos, G.	en
dc.creator	Makedon, F.	en
dc.date.accessioned	2015-11-23T10:26:54Z
dc.date.available	2015-11-23T10:26:54Z
dc.date.issued	2012
dc.identifier.isbn	9781467310680
dc.identifier.issn	22195491
dc.identifier.uri	http://hdl.handle.net/11615/27630
dc.description.abstract	We investigate the use of facial depth data of a speaking subject, captured by the Kinect device, as an additional speechinformative modality to incorporate to a traditional audiovisual automatic speech recognizer. We present our feature extraction algorithm for both visual and accompanying depth modalities, based on a discrete cosine transform of the mouth region-of-interest data, further transformed by a two-stage linear discriminant analysis projection to incorporate speech dynamics and improve classification. For automatic speech recognition utilizing the three available data streams (audio, visual, and depth), we consider both the feature and decision fusion paradigms, the latter via a state-synchronous tri-stream hidden Markov model. We report multi-speaker recognition results on a small-vocabulary task employing our recently collected bilingual audio-visual corpus with depth information, demonstrating improved recognition performance by the addition of the proposed depth stream, across a wide range of audio conditions. © 2012 EURASIP.	en
dc.source.uri	http://www.scopus.com/inward/record.url?eid=2-s2.0-84869749157&partnerID=40&md5=6eedbff65ffa3b88447d581ff8268238
dc.subject	Audio-visual automatic speech recognition	en
dc.subject	depth information	en
dc.subject	linear discriminant analysis	en
dc.subject	Microsoft Kinect	en
dc.subject	multi-sensory fusion	en
dc.subject	Automatic speech recognition	en
dc.subject	MicroSoft	en
dc.subject	Discrete cosine transforms	en
dc.subject	Hidden Markov models	en
dc.subject	Signal processing	en
dc.subject	Speech processing	en
dc.subject	Speech recognition	en
dc.title	Audio-visual speech recognition incorporating facial depth information captured by the Kinect	en
dc.type	conferenceItem	en

Fichier(s) constituant ce document

Fichiers	Taille	Format	Vue
Il n'y a pas de fichiers associés à ce document.

Ce document figure dans la(les) collection(s) suivante(s)

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19705]

Afficher la notice abrégée

Audio-visual speech recognition incorporating facial depth information captured by the Kinect

Fichier(s) constituant ce document

Ce document figure dans la(les) collection(s) suivante(s)

Related items

ATHENA: A Greek multi-sensory database for home automation control ﻿

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions ﻿

Multi-room speech activity detection using a distributed microphone network in domestic environments ﻿

ATHENA: A Greek multi-sensory database for home automation control

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

Multi-room speech activity detection using a distributed microphone network in domestic environments