Multimodal fusion and sequence learning for cued speech recognition from videos

Papadimitriou K., Parelli M., Sapountzaki G., Pavlakos G., Maragos P., Potamianos G.

dc.creator	Papadimitriou K., Parelli M., Sapountzaki G., Pavlakos G., Maragos P., Potamianos G.	en
dc.date.accessioned	2023-01-31T09:42:20Z
dc.date.available	2023-01-31T09:42:20Z
dc.date.issued	2021
dc.identifier	10.1007/978-3-030-78095-1_21
dc.identifier.isbn	9783030780944
dc.identifier.issn	03029743
dc.identifier.uri	http://hdl.handle.net/11615/77584
dc.description.abstract	Cued Speech (CS) constitutes a non-vocal mode of communication that relies on lip movements in conjunction with hand positional and gestural cues, in order to disambiguate phonetic information and make it accessible to the speech and hearing impaired. In this study, we address the automatic recognition of CS from videos, employing deep learning techniques and extending our earlier work on this topic as follows: First, for visual feature extraction, in addition to hand positioning embeddings and convolutional neural network-based appearance features of the mouth region and signing hand, we consider structural information of the hand and mouth articulators. Specifically, we utilize the OpenPose framework to extract 2D lip keypoints and hand skeletal coordinates of the signer, and we also infer 3D hand skeletal coordinates from the latter exploiting own earlier work on 2D-to-3D hand-pose regression. Second, we modify the sequence learning model, by considering a time-depth separable (TDS) convolution block structure that encodes the fused visual features, in conjunction with a decoder that is based on connectionist temporal classification for phonetic sequence prediction. We investigate the contribution of the above to CS recognition, evaluating our model on a French and a British English CS video dataset, and we report significant gains over the state-of-the-art on both sets. © Springer Nature Switzerland AG 2021.	en
dc.language.iso	en	en
dc.source	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	en
dc.source.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85117904777&doi=10.1007%2f978-3-030-78095-1_21&partnerID=40&md5=0d14d79f6b64359ae8884bc8d84fd076
dc.subject	Audition	en
dc.subject	Classification (of information)	en
dc.subject	Convolution	en
dc.subject	Deep learning	en
dc.subject	Linguistics	en
dc.subject	Speech	en
dc.subject	Speech communication	en
dc.subject	Speech recognition	en
dc.subject	2D-To-3D	en
dc.subject	2d-to-3d hand-pose regression	en
dc.subject	Connectionist temporal classification	en
dc.subject	Convolutional encoders	en
dc.subject	Convolutional neural network	en
dc.subject	Cued speech	en
dc.subject	Cued speech recognition	en
dc.subject	Hand pose	en
dc.subject	Openpose	en
dc.subject	Skeleton	en
dc.subject	Temporal classification	en
dc.subject	Time depth	en
dc.subject	Time-depth separable convolutional encoder	en
dc.subject	Convolutional neural networks	en
dc.subject	Springer Science and Business Media Deutschland GmbH	en
dc.title	Multimodal fusion and sequence learning for cued speech recognition from videos	en
dc.type	conferenceItem	en

Fichier(s) constituant ce document

Fichiers	Taille	Format	Vue
Il n'y a pas de fichiers associés à ce document.

Ce document figure dans la(les) collection(s) suivante(s)

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19735]

Afficher la notice abrégée

Multimodal fusion and sequence learning for cued speech recognition from videos

Fichier(s) constituant ce document

Ce document figure dans la(les) collection(s) suivante(s)

Related items

A fully convolutional sequence learning approach for cued speech recognition from videos ﻿

Resource-efficient TDNN Architectures for Audio-visual Speech Recognition ﻿

Resource-adaptive deep learning for visual speech recognition ﻿

A fully convolutional sequence learning approach for cued speech recognition from videos

Resource-efficient TDNN Architectures for Audio-visual Speech Recognition

Resource-adaptive deep learning for visual speech recognition