Mostra i principali dati dell'item
Resource-efficient TDNN Architectures for Audio-visual Speech Recognition
dc.creator | Koumparoulis A., Potamianos G., Thomas S., da Silva Morais E. | en |
dc.date.accessioned | 2023-01-31T08:45:25Z | |
dc.date.available | 2023-01-31T08:45:25Z | |
dc.date.issued | 2021 | |
dc.identifier | 10.23919/EUSIPCO54536.2021.9616215 | |
dc.identifier.isbn | 9789082797060 | |
dc.identifier.issn | 22195491 | |
dc.identifier.uri | http://hdl.handle.net/11615/75305 | |
dc.description.abstract | In this paper, we consider the problem of resource-efficient architectures for audio-visual automatic speech recognition (AVSR). Specifically, we complement our earlier work that introduced efficient convolutional neural networks (CNNs) for visual-only speech recognition, by focusing here on the sequence modeling component of the architecture, proposing a novel resource-efficient time-delay neural network (TDNN) that we extend for AVSR. In more detail, we introduce the sTDNN-F module, which combines the factored TDNN (TDNN-F) with grouped fully-connected layers and the shuffle operation. We then develop an AVSR system based on the sTDNN-F, incorporating the efficient CNNs of our earlier work and other standard visual processing and speech recognition modules. We evaluate our approach on the popular TCD-TIMIT corpus, under two speaker-independent training/testing scenarios. Our best sTDNN-F based AVSR system turns out 74% more efficient than a traditional TDNN one and 35% more efficient than TDNN-F, while maintaining similar recognition accuracy and noise robustness, and also significantly outperforming its audio-only counterpart. © 2021 European Signal Processing Conference. All rights reserved. | en |
dc.language.iso | en | en |
dc.source | European Signal Processing Conference | en |
dc.source.uri | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85123160449&doi=10.23919%2fEUSIPCO54536.2021.9616215&partnerID=40&md5=450341a7da64e25d2560fcb5babbe19a | |
dc.subject | Audio acoustics | en |
dc.subject | Convolutional neural networks | en |
dc.subject | Network architecture | en |
dc.subject | Speech recognition | en |
dc.subject | Audio-visual | en |
dc.subject | Audio-visual automatic speech recognition | en |
dc.subject | Audiovisual speech recognition | en |
dc.subject | Automatic speech recognition | en |
dc.subject | Automatic speech recognition system | en |
dc.subject | Convolutional neural network | en |
dc.subject | Mobilipnet | en |
dc.subject | Neural network architecture | en |
dc.subject | Resource-efficient | en |
dc.subject | Time delay neural networks | en |
dc.subject | Computational efficiency | en |
dc.subject | European Signal Processing Conference, EUSIPCO | en |
dc.title | Resource-efficient TDNN Architectures for Audio-visual Speech Recognition | en |
dc.type | conferenceItem | en |
Files in questo item
Files | Dimensione | Formato | Mostra |
---|---|---|---|
Nessun files in questo item. |