Authors: Mir Tahmid Hossain, Mahsa Sanaei Nourani, Zahra Rahimian, Md Nawab Yousuf Ali
Keywords: articulation, spoken language, multimodal learning, speech synthesis, language understanding.
Abstract
Our framework Phonesis is a machine-learned model of spoken language as an embodied mechanism over real human behaviour based on multimodal corpora (MOCHA-TIMIT, GRID, VoxCeleb2). Unlike approaches based on simulations, it combines a Speaker model (for interpreting the visual input and intention and converting it into realistic speech) and a Listener model (for the audio interpretation). Phonesis is trained in the end-to-end, which means that it is highly accurate in articulatory prediction, speech quality, and intent decoding. The results of ablation studies have indicated that joint optimization enhances performance. Zero-shot evaluation VoxCeleb2 has a high degree of generalization, indicating that it can be used in voice rehabilitation, brain-computer interfaces, and language understanding.
Mir Tahmid Hossain
ORCID: https://orcid.org/0009-0000-0852-2039
Mid Sweden University
Sundsvall, Sweden
E-mail:
Mahsa Sanaei Nourani
ORCID: https://orcid.org/0009-0007-8611-6862
Tabarestan University
Mazandaran, Iran
E-mail:
Zahra Rahimian
ORCID: https://orcid.org/0009-0001-7581-1735
Dr. Shariaty Technical and Vocational College
Tehran, Iran
E-mail:
Dr Md Nawab Yousuf Ali
ORCID: https://orcid.org/0009-0009-8069-4527
East West University Department of Computer Science and Engineering
Dhaka, Bangladesh
E-mail:
DOI
https://doi.org/10.56415/csjm.v34.05
Fulltext

–
0.61 Mb