Home > Published Issues > 2026 > Volume 15, No. 3, May 2026 >
IJEETC 2026 Vol.15(3): 205-220
doi: 10.18178/ijeetc.15.3.205-220

A Confidence-Weighted Memory-Augmented Multimodal Voice Agent for Real-Time Emotion-Aware Interaction

Vijaya Bharathi A.*and Prashant Nitnaware
Department of Computer Engineering, Pillai College of Engineering, Panvel, India
Email: jagan21it@student.mes.ac.in (V.B.A.), pnitnaware@mes.ac.in (P.N.)
*Corresponding author

Manuscript received March 11, 2026; revised April 17, 2026; accepted May 2, 2026

Abstract—Conversational emotion-aware systems are critical in the development of human-centric Artificial Intelligence (AI). Single-modality systems often fail to identify acoustic patterns and text meaning associated with emotional expression. The study proposes an emotion-aware conversational framework that integrates offline-trained speech emotion recognition with a multimodal chatbot pipeline. In the offline phase, three self-supervised speech models, Wav2Vec2 (waveform to vectors version 2), Hidden-Unit Bidirectional Encoder Representations from Transformers (HuBERT), and Whisper, were fine-tuned on the Ryerson Audio-visual Database of Emotional Speech and Song (RAVDESS) emotional speech dataset for multi-class emotion classification. Among them, the fine-tuned Whisper model achieved superior performance and was selected as the primary acoustic emotion encoder. The trained Whisper model was then deployed within a real-time chatbot architecture, where speech input is transcribed using automatic speech recognition and combined with acoustic emotion predictions. A confidence-weighted multimodal fusion strategy integrates audio and text-based emotion cues, followed by an Adaptive Confidence-Weighted Temporal Memory (ACWTM) module to model short-term emotional continuity. The model attained 87.3 percent accuracy and outperforms the unimodal baselines. However, previous emotion-aware models have failed to address the complex issue of cross-modal dependencies and temporal information in an accurate manner due to real-time considerations, resulting in poor recognition results. The current study offers a novelty with new modules, Adaptive Confidence-Weighted Decision fusion Module (ACWDFM) and ACWTM, which play key roles in improving emotion recognition. It achieves enhanced robustness and conversational coherence with stability, offering a scalable framework for real-time empathetic human-AI interaction.


Index Terms—artificial intelligence, conversational system, emotion-aware interaction, memory-augmented voice agent, multimodal voice agent



Cite: Vijaya Bharathi A. and Prashant Nitnaware, "A Confidence-Weighted Memory-Augmented Multimodal Voice Agent for Real-Time Emotion-Aware Interaction," International Journal of Electrical and Electronic Engineering & Telecommunications, vol. 15, no. 3, pp. 205-220, 2026. doi: 10.18178/ijeetc.15.3.205-220