Afficher la notice abrégée

dc.creatorPapadimitriou K., Parelli M., Sapountzaki G., Pavlakos G., Maragos P., Potamianos G.en
dc.date.accessioned2023-01-31T09:42:20Z
dc.date.available2023-01-31T09:42:20Z
dc.date.issued2021
dc.identifier10.1007/978-3-030-78095-1_21
dc.identifier.isbn9783030780944
dc.identifier.issn03029743
dc.identifier.urihttp://hdl.handle.net/11615/77584
dc.description.abstractCued Speech (CS) constitutes a non-vocal mode of communication that relies on lip movements in conjunction with hand positional and gestural cues, in order to disambiguate phonetic information and make it accessible to the speech and hearing impaired. In this study, we address the automatic recognition of CS from videos, employing deep learning techniques and extending our earlier work on this topic as follows: First, for visual feature extraction, in addition to hand positioning embeddings and convolutional neural network-based appearance features of the mouth region and signing hand, we consider structural information of the hand and mouth articulators. Specifically, we utilize the OpenPose framework to extract 2D lip keypoints and hand skeletal coordinates of the signer, and we also infer 3D hand skeletal coordinates from the latter exploiting own earlier work on 2D-to-3D hand-pose regression. Second, we modify the sequence learning model, by considering a time-depth separable (TDS) convolution block structure that encodes the fused visual features, in conjunction with a decoder that is based on connectionist temporal classification for phonetic sequence prediction. We investigate the contribution of the above to CS recognition, evaluating our model on a French and a British English CS video dataset, and we report significant gains over the state-of-the-art on both sets. © Springer Nature Switzerland AG 2021.en
dc.language.isoenen
dc.sourceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)en
dc.source.urihttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85117904777&doi=10.1007%2f978-3-030-78095-1_21&partnerID=40&md5=0d14d79f6b64359ae8884bc8d84fd076
dc.subjectAuditionen
dc.subjectClassification (of information)en
dc.subjectConvolutionen
dc.subjectDeep learningen
dc.subjectLinguisticsen
dc.subjectSpeechen
dc.subjectSpeech communicationen
dc.subjectSpeech recognitionen
dc.subject2D-To-3Den
dc.subject2d-to-3d hand-pose regressionen
dc.subjectConnectionist temporal classificationen
dc.subjectConvolutional encodersen
dc.subjectConvolutional neural networken
dc.subjectCued speechen
dc.subjectCued speech recognitionen
dc.subjectHand poseen
dc.subjectOpenposeen
dc.subjectSkeletonen
dc.subjectTemporal classificationen
dc.subjectTime depthen
dc.subjectTime-depth separable convolutional encoderen
dc.subjectConvolutional neural networksen
dc.subjectSpringer Science and Business Media Deutschland GmbHen
dc.titleMultimodal fusion and sequence learning for cued speech recognition from videosen
dc.typeconferenceItemen


Fichier(s) constituant ce document

FichiersTailleFormatVue

Il n'y a pas de fichiers associés à ce document.

Ce document figure dans la(les) collection(s) suivante(s)

Afficher la notice abrégée