Detecting audio-visual synchrony using deep neural networks

Marcheret E., Potamianos G., Vopicka J., Goel V.

dc.creator	Marcheret E., Potamianos G., Vopicka J., Goel V.	en
dc.date.accessioned	2023-01-31T08:57:17Z
dc.date.available	2023-01-31T08:57:17Z
dc.date.issued	2015
dc.identifier.issn	2308457X
dc.identifier.uri	http://hdl.handle.net/11615/76339
dc.description.abstract	In this paper, we address the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not. This is of interest in a wide range of applications, for example spoof detection in biometrics, lip-syncing, speaker detection and diarization in multi-subject videos, and video data quality assurance. In our adopted approach, we investigate the use of deep neural networks (DNNs) for this purpose. The proposed synchrony DNNs operate directly on audio and visual features over relatively wide contexts, or, alternatively, on appropriate hidden (bottleneck) or output layers of DNNs trained for single-modal or audio-visual automatic speech recognition. In all cases, the synchrony DNN classes consist of the "in-sync" and a number of "out-of-sync" targets, the latter considered at multiples of ± 30 msec steps of overall asynchrony between the two modalities. We apply the proposed approach on two multi-subject audio-visual databases, one of high-quality data recorded in studio-like conditions, and one of data recorded by smart cell-phone devices. On both sets, and under a speaker-independent experimental framework, we are able to achieve very low equal-error-rates in distinguishing "in-sync" from "out-of-sync" data. Copyright © 2015 ISCA.	en
dc.language.iso	en	en
dc.source	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH	en
dc.source.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-84959124862&partnerID=40&md5=529ba3e5d1d00730557b7400378af159
dc.subject	Mobile phones	en
dc.subject	Quality assurance	en
dc.subject	Speech communication	en
dc.subject	Audio-visual	en
dc.subject	Audio-visual database	en
dc.subject	Automatic speech recognition	en
dc.subject	Deep neural networks	en
dc.subject	High quality data	en
dc.subject	Smart cell phones	en
dc.subject	Speaker detection	en
dc.subject	Speaker independents	en
dc.subject	Speech recognition	en
dc.subject	International Speech and Communication Association	en
dc.title	Detecting audio-visual synchrony using deep neural networks	en
dc.type	conferenceItem	en

Αρχεία σε αυτό το τεκμήριο

Αρχεία	Μέγεθος	Τύπος	Προβολή
Δεν υπάρχουν αρχεία που να σχετίζονται με αυτό το τεκμήριο.

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19735]

Εμφάνιση απλής εγγραφής