RO  EN
IMI/Publicaţii/CSJM/Ediţii/CSJM v.34, n1. (100), 2026/

Phonesis: Towards embodied spoken language models grounded in human physiology

Authors: Mir Tahmid Hossain, Mahsa Sanaei Nourani, Zahra Rahimian, Md Nawab Yousuf Ali
Keywords: articulation, spoken language, multimodal learning, speech synthesis, language understanding.

Abstract

Our framework Phonesis is a machine-learned model of spoken language as an embodied mechanism over real human behaviour based on multimodal corpora (MOCHA-TIMIT, GRID, VoxCeleb2). Unlike approaches based on simulations, it combines a Speaker model (for interpreting the visual input and intention and converting it into realistic speech) and a Listener model (for the audio interpretation). Phonesis is trained in the end-to-end, which means that it is highly accurate in articulatory prediction, speech quality, and intent decoding. The results of ablation studies have indicated that joint optimization enhances performance. Zero-shot evaluation VoxCeleb2 has a high degree of generalization, indicating that it can be used in voice rehabilitation, brain-computer interfaces, and language understanding.

Mir Tahmid Hossain
ORCID: https://orcid.org/0009-0000-0852-2039
Mid Sweden University
Sundsvall, Sweden
E-mail:

Mahsa Sanaei Nourani
ORCID: https://orcid.org/0009-0007-8611-6862
Tabarestan University
Mazandaran, Iran
E-mail:

Zahra Rahimian
ORCID: https://orcid.org/0009-0001-7581-1735
Dr. Shariaty Technical and Vocational College
Tehran, Iran
E-mail:

Dr Md Nawab Yousuf Ali
ORCID: https://orcid.org/0009-0009-8069-4527
East West University Department of Computer Science and Engineering
Dhaka, Bangladesh
E-mail:

DOI

https://doi.org/10.56415/csjm.v34.05

Fulltext

Adobe PDF document0.61 Mb