Audio-visual speech recognition incorporating facial depth information captured by the Kinect
dc.creator | Galatas, G. | en |
dc.creator | Potamianos, G. | en |
dc.creator | Makedon, F. | en |
dc.date.accessioned | 2015-11-23T10:26:54Z | |
dc.date.available | 2015-11-23T10:26:54Z | |
dc.date.issued | 2012 | |
dc.identifier.isbn | 9781467310680 | |
dc.identifier.issn | 22195491 | |
dc.identifier.uri | http://hdl.handle.net/11615/27630 | |
dc.description.abstract | We investigate the use of facial depth data of a speaking subject, captured by the Kinect device, as an additional speechinformative modality to incorporate to a traditional audiovisual automatic speech recognizer. We present our feature extraction algorithm for both visual and accompanying depth modalities, based on a discrete cosine transform of the mouth region-of-interest data, further transformed by a two-stage linear discriminant analysis projection to incorporate speech dynamics and improve classification. For automatic speech recognition utilizing the three available data streams (audio, visual, and depth), we consider both the feature and decision fusion paradigms, the latter via a state-synchronous tri-stream hidden Markov model. We report multi-speaker recognition results on a small-vocabulary task employing our recently collected bilingual audio-visual corpus with depth information, demonstrating improved recognition performance by the addition of the proposed depth stream, across a wide range of audio conditions. © 2012 EURASIP. | en |
dc.source.uri | http://www.scopus.com/inward/record.url?eid=2-s2.0-84869749157&partnerID=40&md5=6eedbff65ffa3b88447d581ff8268238 | |
dc.subject | Audio-visual automatic speech recognition | en |
dc.subject | depth information | en |
dc.subject | linear discriminant analysis | en |
dc.subject | Microsoft Kinect | en |
dc.subject | multi-sensory fusion | en |
dc.subject | Automatic speech recognition | en |
dc.subject | MicroSoft | en |
dc.subject | Discrete cosine transforms | en |
dc.subject | Hidden Markov models | en |
dc.subject | Signal processing | en |
dc.subject | Speech processing | en |
dc.subject | Speech recognition | en |
dc.title | Audio-visual speech recognition incorporating facial depth information captured by the Kinect | en |
dc.type | conferenceItem | en |
Fichier(s) constituant ce document
Fichiers | Taille | Format | Vue |
---|---|---|---|
Il n'y a pas de fichiers associés à ce document. |