Logo
    • English
    • Ελληνικά
    • Deutsch
    • français
    • italiano
    • español
  • Ελληνικά 
    • English
    • Ελληνικά
    • Deutsch
    • français
    • italiano
    • español
  • Σύνδεση
Προβολή τεκμηρίου 
  •   Ιδρυματικό Αποθετήριο Πανεπιστημίου Θεσσαλίας
  • Επιστημονικές Δημοσιεύσεις Μελών ΠΘ (ΕΔΠΘ)
  • Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ.
  • Προβολή τεκμηρίου
  •   Ιδρυματικό Αποθετήριο Πανεπιστημίου Θεσσαλίας
  • Επιστημονικές Δημοσιεύσεις Μελών ΠΘ (ΕΔΠΘ)
  • Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ.
  • Προβολή τεκμηρίου
JavaScript is disabled for your browser. Some features of this site may not work without it.
Ιδρυματικό Αποθετήριο Πανεπιστημίου Θεσσαλίας
Όλο το DSpace
  • Κοινότητες & Συλλογές
  • Ανά ημερομηνία δημοσίευσης
  • Συγγραφείς
  • Τίτλοι
  • Λέξεις κλειδιά

Multimodal fusion and sequence learning for cued speech recognition from videos

Thumbnail
Συγγραφέας
Papadimitriou K., Parelli M., Sapountzaki G., Pavlakos G., Maragos P., Potamianos G.
Ημερομηνία
2021
Γλώσσα
en
DOI
10.1007/978-3-030-78095-1_21
Λέξη-κλειδί
Audition
Classification (of information)
Convolution
Deep learning
Linguistics
Speech
Speech communication
Speech recognition
2D-To-3D
2d-to-3d hand-pose regression
Connectionist temporal classification
Convolutional encoders
Convolutional neural network
Cued speech
Cued speech recognition
Hand pose
Openpose
Skeleton
Temporal classification
Time depth
Time-depth separable convolutional encoder
Convolutional neural networks
Springer Science and Business Media Deutschland GmbH
Εμφάνιση Μεταδεδομένων
Επιτομή
Cued Speech (CS) constitutes a non-vocal mode of communication that relies on lip movements in conjunction with hand positional and gestural cues, in order to disambiguate phonetic information and make it accessible to the speech and hearing impaired. In this study, we address the automatic recognition of CS from videos, employing deep learning techniques and extending our earlier work on this topic as follows: First, for visual feature extraction, in addition to hand positioning embeddings and convolutional neural network-based appearance features of the mouth region and signing hand, we consider structural information of the hand and mouth articulators. Specifically, we utilize the OpenPose framework to extract 2D lip keypoints and hand skeletal coordinates of the signer, and we also infer 3D hand skeletal coordinates from the latter exploiting own earlier work on 2D-to-3D hand-pose regression. Second, we modify the sequence learning model, by considering a time-depth separable (TDS) convolution block structure that encodes the fused visual features, in conjunction with a decoder that is based on connectionist temporal classification for phonetic sequence prediction. We investigate the contribution of the above to CS recognition, evaluating our model on a French and a British English CS video dataset, and we report significant gains over the state-of-the-art on both sets. © Springer Nature Switzerland AG 2021.
URI
http://hdl.handle.net/11615/77584
Collections
  • Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19735]

Related items

Showing items related by title, author, creator and subject.

  • Thumbnail

    A fully convolutional sequence learning approach for cued speech recognition from videos 

    Papadimitriou K., Potamianos G. (2021)
    Cued Speech constitutes a sign-based communication variant for the speech and hearing impaired, which involves visual information from lip movements combined with hand positional and gestural cues. In this paper, we consider ...
  • Thumbnail

    Resource-efficient TDNN Architectures for Audio-visual Speech Recognition 

    Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E. (2021)
    In this paper, we consider the problem of resource-efficient architectures for audio-visual automatic speech recognition (AVSR). Specifically, we complement our earlier work that introduced efficient convolutional neural ...
  • Thumbnail

    Resource-adaptive deep learning for visual speech recognition 

    Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E. (2020)
    We focus on the problem of efficient architectures for lipreading that allow trading-off computational resources for visual speech recognition accuracy. In particular, we make two contributions: First, we introduce ...
htmlmap 

 

Πλοήγηση

Όλο το DSpaceΚοινότητες & ΣυλλογέςΑνά ημερομηνία δημοσίευσηςΣυγγραφείςΤίτλοιΛέξεις κλειδιάΑυτή η συλλογήΑνά ημερομηνία δημοσίευσηςΣυγγραφείςΤίτλοιΛέξεις κλειδιά

Ο λογαριασμός μου

ΣύνδεσηΕγγραφή (MyDSpace)
Πληροφορίες-Επικοινωνία
ΑπόθεσηΣχετικά μεΒοήθειαΕπικοινωνήστε μαζί μας
Επιλογή ΓλώσσαςΌλο το DSpace
EnglishΕλληνικά
htmlmap