Home > Published Issues > 2026 > Volume 15, No. 3, May 2026 >
IJEETC 2026 Vol.15(3): 192-204
doi: 10.18178/ijeetc.15.3.192-204

Hybrid CNN-LSTM Architecture with MFCC-Based Class-Balanced Augmentation for Arabic Speech Emotion Recognition

Sarmad H. Alfarag
Electrical Engineering Department, Wasit University, Wasit, Iraq
Email: sarmad.hamad@uowasit.edu.iq (S.H.A.)
*Corresponding author

Manuscript received January 1, 2026; revised April 1, 2026; accepted April 21, 2026

Abstract—Arabic Speech Emotion Recognition (SER) is also associated with serious challenges, including the variety of dialects, excessive unbalanced classes, and the lack of sufficient data. In order to resolve these concerns, the paper provides a comprehensive methodology, which integrates hybrid Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) architecture with strategic data augmentation. Our method is evaluated with the help of two Arabic emotional speech data sets: Eastern Youngsters Arabic Speech Emotions (EYASE) which is the independent validation set and Berlin Arabic Vocal Emotions Dataset (BAVED) which is the actual training data. Class-balanced augmentation techniques such as pitch shifting, time stretching, and noise injection are used to enhance the grotesquely unbalanced Basic Arabic Vocal Emotions Dataset (BAVED) dataset with approximately 1,600 samples of each emotion class. Our hybrid architecture integrates both the bidirectional Long Short-Term Memory (LSTM) networks and attention mechanism of modelling a temporal sequence, and the convolutional neural network of extracting spatial features of Mel-Frequency Cepstral Coefficient (MFCC) representations. The experimental findings demonstrate that the performance of the POL2 is increased substantially, as the general accuracy on BAVED increases to 97.23% as compared to baseline (89%). Cross-dataset testing of EYASE indicates that the generalization of EYASE is high across different speakers and recording conditions and that this accuracy is 87%. The findings demonstrate that balanced augmentation of data quality has a greater impact on performance than architectural complexity alone, providing useful guidance for the development of Arabic emotion recognition systems in environments with limited resources.


Index Terms—Arabic language, attention mechanism, class imbalance, Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM), data augmentation, low-resource languages, Mel-Frequency Cepstral Coefficient (MFCC) features, speech emotion recognition



Cite: Sarmad H. Alfarag, "Hybrid CNN-LSTM Architecture with MFCC-Based Class-Balanced Augmentation for Arabic Speech Emotion Recognition," International Journal of Electrical and Electronic Engineering & Telecommunications, vol. 15, no. 3, pp. 192-204, 2026. doi: 10.18178/ijeetc.15.3.192-204


Copyright © 2026 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).