Multi-room speech activity detection using a distributed microphone network in domestic environments
Autore
Giannoulis P., Brutti A., Matassoni M., Abad A., Katsamanis A., Matos M., Potamianos G., Maragos P.Data
2015Language
en
Soggetto
Abstract
Domestic environments are particularly challenging for distant speech recognition: reverberation, background noise and interfering sources, as well as the propagation of acoustic events across adjacent rooms, critically degrade the performance of standard speech processing algorithms. In this application scenario, a crucial task is the detection and localization of speech events generated by users within the various rooms. A specific challenge of multi-room environments is the inter-room interference that negatively affects speech activity detectors. In this paper, we present and compare different solutions for the multi-room speech activity detection task. The combination of a model-based room-independent speech activity detection module with a room-dependent inside/outside classification stage, based on specific features, provides satisfactory performance. The proposed methods are evaluated on a multi-room, multi-channel corpus, where spoken commands and other typical acoustic events occur in different rooms. © 2015 EURASIP.
Collections
Related items
Showing items related by title, author, creator and subject.
-
ATHENA: A Greek multi-sensory database for home automation control
Tsiami, A.; Rodomagoulakis, I.; Giannoulis, P.; Katsamanis, A.; Potamianos, G.; Maragos, P. (2014)In this paper we present a Greek speech database with real multi-modal data in a smart home two-room environment. In total, 20 speakers were recorded in 240 one-minute long sessions. The recordings include utterances of ... -
Audio-visual speech recognition using depth information from the Kinect in noisy video conditions
Galatas, G.; Potamianos, G.; Makedon, F. (2012)In this paper we build on our recent work, where we successfully incorporated facial depth data of a speaker captured by the Microsoft Kinect device, as a third data stream in an audio-visual automatic speech recognizer. ... -
Multimodal fusion and sequence learning for cued speech recognition from videos
Papadimitriou K., Parelli M., Sapountzaki G., Pavlakos G., Maragos P., Potamianos G. (2021)Cued Speech (CS) constitutes a non-vocal mode of communication that relies on lip movements in conjunction with hand positional and gestural cues, in order to disambiguate phonetic information and make it accessible to the ...