Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view

Thermos S., Potamianos G.

dc.creator	Thermos S., Potamianos G.	en
dc.date.accessioned	2023-01-31T10:08:16Z
dc.date.available	2023-01-31T10:08:16Z
dc.date.issued	2017
dc.identifier	10.1109/SLT.2016.7846321
dc.identifier.isbn	9781509049035
dc.identifier.uri	http://hdl.handle.net/11615/79699
dc.description.abstract	Motivated by increasing popularity of depth visual sensors, such as the Kinect device, we investigate the utility of depth information in audio-visual speech activity detection. A two-subject scenario is assumed, allowing to also consider speech overlap. Two sensory setups are employed, where depth video captures either a frontal or profile view of the subjects, and is subsequently combined with the corresponding planar video and audio streams. Further, multi-view fusion is regarded, using audio and planar video from a sensor at the complementary view setup. Support vector machines provide temporal speech activity classification for each visually detected subject, fusing the available modality streams. Classification results are further combined to yield speaker diarization. Experiments are reported on a suitable audio-visual corpus recorded by two Kinects. Results demonstrate the benefits of depth information, particularly in the frontal depth view setup, reducing speech activity detection and speaker diarization errors over systems that ignore it. © 2016 IEEE.	en
dc.language.iso	en	en
dc.source	2016 IEEE Workshop on Spoken Language Technology, SLT 2016 - Proceedings	en
dc.source.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85016000416&doi=10.1109%2fSLT.2016.7846321&partnerID=40&md5=1d74fca461c773aea90f19566e4a0ef6
dc.subject	Speech	en
dc.subject	Speech analysis	en
dc.subject	Audio-visual fusion	en
dc.subject	Kinect	en
dc.subject	Speaker diarization	en
dc.subject	Speech activity detections	en
dc.subject	Visual depth	en
dc.subject	Speech recognition	en
dc.subject	Institute of Electrical and Electronics Engineers Inc.	en
dc.title	Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view	en
dc.type	conferenceItem	en

Files in questo item

Files	Dimensione	Formato	Mostra
Nessun files in questo item.

Questo item appare nelle seguenti collezioni

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19705]

Mostra i principali dati dell'item

Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view

Files in questo item

Questo item appare nelle seguenti collezioni

Related items

ATHENA: A Greek multi-sensory database for home automation control ﻿

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions ﻿

Multi-room speech activity detection using a distributed microphone network in domestic environments ﻿

ATHENA: A Greek multi-sensory database for home automation control

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

Multi-room speech activity detection using a distributed microphone network in domestic environments