• English
    • Ελληνικά
    • Deutsch
    • français
    • italiano
    • español
  • Deutsch 
    • English
    • Ελληνικά
    • Deutsch
    • français
    • italiano
    • español
  • Einloggen
Dokumentanzeige 
  •   DSpace Startseite
  • Επιστημονικές Δημοσιεύσεις Μελών ΠΘ (ΕΔΠΘ)
  • Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ.
  • Dokumentanzeige
  •   DSpace Startseite
  • Επιστημονικές Δημοσιεύσεις Μελών ΠΘ (ΕΔΠΘ)
  • Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ.
  • Dokumentanzeige
JavaScript is disabled for your browser. Some features of this site may not work without it.
Gesamter Bestand
  • Bereiche & Sammlungen
  • Erscheinungsdatum
  • Autoren
  • Titeln
  • Schlagworten

Detecting audio-visual synchrony using deep neural networks

Thumbnail
Autor
Marcheret E., Potamianos G., Vopicka J., Goel V.
Datum
2015
Language
en
Schlagwort
Mobile phones
Quality assurance
Speech communication
Audio-visual
Audio-visual database
Automatic speech recognition
Deep neural networks
High quality data
Smart cell phones
Speaker detection
Speaker independents
Speech recognition
International Speech and Communication Association
Zur Langanzeige
Zusammenfassung
In this paper, we address the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not. This is of interest in a wide range of applications, for example spoof detection in biometrics, lip-syncing, speaker detection and diarization in multi-subject videos, and video data quality assurance. In our adopted approach, we investigate the use of deep neural networks (DNNs) for this purpose. The proposed synchrony DNNs operate directly on audio and visual features over relatively wide contexts, or, alternatively, on appropriate hidden (bottleneck) or output layers of DNNs trained for single-modal or audio-visual automatic speech recognition. In all cases, the synchrony DNN classes consist of the "in-sync" and a number of "out-of-sync" targets, the latter considered at multiples of ± 30 msec steps of overall asynchrony between the two modalities. We apply the proposed approach on two multi-subject audio-visual databases, one of high-quality data recorded in studio-like conditions, and one of data recorded by smart cell-phone devices. On both sets, and under a speaker-independent experimental framework, we are able to achieve very low equal-error-rates in distinguishing "in-sync" from "out-of-sync" data. Copyright © 2015 ISCA.
URI
http://hdl.handle.net/11615/76339
Collections
  • Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19735]
htmlmap 

 

Stöbern

Gesamter BestandBereiche & SammlungenErscheinungsdatumAutorenTitelnSchlagwortenDiese SammlungErscheinungsdatumAutorenTitelnSchlagworten

Mein Benutzerkonto

EinloggenRegistrieren
Help Contact
DepositionAboutHelpKontakt
Choose LanguageGesamter Bestand
EnglishΕλληνικά
htmlmap