E-mail: editor@ijeetc.com; nancy.liu@ijeetc.com
Prof. Pascal Lorenz
University of Haute Alsace, FranceIt is my honor to be the editor-in-chief of IJEETC. The journal publishes good papers which focus on the advanced researches in the field of electrical and electronic engineering and telecommunications.
2026-05-22
2026-03-01
2026-02-04
Manuscript received March 11, 2026; revised April 17, 2026; accepted May 2, 2026
Abstract—Conversational emotion-aware systems are critical in the development of human-centric Artificial Intelligence (AI). Single-modality systems often fail to identify acoustic patterns and text meaning associated with emotional expression. The study proposes an emotion-aware conversational framework that integrates offline-trained speech emotion recognition with a multimodal chatbot pipeline. In the offline phase, three self-supervised speech models, Wav2Vec2 (waveform to vectors version 2), Hidden-Unit Bidirectional Encoder Representations from Transformers (HuBERT), and Whisper, were fine-tuned on the Ryerson Audio-visual Database of Emotional Speech and Song (RAVDESS) emotional speech dataset for multi-class emotion classification. Among them, the fine-tuned Whisper model achieved superior performance and was selected as the primary acoustic emotion encoder. The trained Whisper model was then deployed within a real-time chatbot architecture, where speech input is transcribed using automatic speech recognition and combined with acoustic emotion predictions. A confidence-weighted multimodal fusion strategy integrates audio and text-based emotion cues, followed by an Adaptive Confidence-Weighted Temporal Memory (ACWTM) module to model short-term emotional continuity. The model attained 87.3 percent accuracy and outperforms the unimodal baselines. However, previous emotion-aware models have failed to address the complex issue of cross-modal dependencies and temporal information in an accurate manner due to real-time considerations, resulting in poor recognition results. The current study offers a novelty with new modules, Adaptive Confidence-Weighted Decision fusion Module (ACWDFM) and ACWTM, which play key roles in improving emotion recognition. It achieves enhanced robustness and conversational coherence with stability, offering a scalable framework for real-time empathetic human-AI interaction.
Index Terms—artificial intelligence, conversational system, emotion-aware interaction, memory-augmented voice agent, multimodal voice agent
Cite: Vijaya Bharathi A. and Prashant Nitnaware, "A Confidence-Weighted Memory-Augmented Multimodal Voice Agent for Real-Time Emotion-Aware Interaction," International Journal of Electrical and Electronic Engineering & Telecommunications, vol. 15, no. 3, pp. 205-220, 2026. doi: 10.18178/ijeetc.15.3.205-220