Hybrid CNN-LSTM Architecture with MFCC-Based Class-Balanced Augmentation for Arabic Speech Emotion Recognition

General Information

ISSN: 2319-2518 (Online)
Frequency: Bimonthly
Executive Editor-in-Chief: Prof. Jason Z. Kang
Managing Editor: Nancy Liu
DOI: 10.18178/ijeetc
Abstracting/Indexing: Scopus (CiteScore 2025: 6.9), Google Scholar, etc.
E-mail: editor@ijeetc.com; nancy.liu@ijeetc.com
Article Processing Charge: 800 USD

Joumal Metrics

Editor-in-Chief

$D`YT6YOQ)BX${[V_@OE6YO0.png$

Prof. Pascal Lorenz

University of Haute Alsace, France

It is my honor to be the editor-in-chief of IJEETC. The journal publishes good papers which focus on the advanced researches in the field of electrical and electronic engineering and telecommunications.

What's New

2026-06-04

The 2025 CiteScores have been released by Scopus. IJEETC received the CiteScore 2025 with 6.9!

2026-05-22

IJEETC Vol. 15, No. 3 has been published online!

2026-03-01

IJEETC Vol. 15, No. 2 has been published online!

Home > Published Issues > 2026 > Volume 15, No. 3, May 2026 >

IJEETC 2026 Vol.15(3): 192-204
doi: 10.18178/ijeetc.15.3.192-204

Sarmad H. Alfarag

Electrical Engineering Department, Wasit University, Wasit, Iraq
Email: sarmad.hamad@uowasit.edu.iq (S.H.A.)
^*Corresponding author

Manuscript received January 1, 2026; revised April 1, 2026; accepted April 21, 2026

Abstract—Arabic Speech Emotion Recognition (SER) is also associated with serious challenges, including the variety of dialects, excessive unbalanced classes, and the lack of sufficient data. In order to resolve these concerns, the paper provides a comprehensive methodology, which integrates hybrid Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) architecture with strategic data augmentation. Our method is evaluated with the help of two Arabic emotional speech data sets: Eastern Youngsters Arabic Speech Emotions (EYASE) which is the independent validation set and Berlin Arabic Vocal Emotions Dataset (BAVED) which is the actual training data. Class-balanced augmentation techniques such as pitch shifting, time stretching, and noise injection are used to enhance the grotesquely unbalanced Basic Arabic Vocal Emotions Dataset (BAVED) dataset with approximately 1,600 samples of each emotion class. Our hybrid architecture integrates both the bidirectional Long Short-Term Memory (LSTM) networks and attention mechanism of modelling a temporal sequence, and the convolutional neural network of extracting spatial features of Mel-Frequency Cepstral Coefficient (MFCC) representations. The experimental findings demonstrate that the performance of the POL2 is increased substantially, as the general accuracy on BAVED increases to 97.23% as compared to baseline (89%). Cross-dataset testing of EYASE indicates that the generalization of EYASE is high across different speakers and recording conditions and that this accuracy is 87%. The findings demonstrate that balanced augmentation of data quality has a greater impact on performance than architectural complexity alone, providing useful guidance for the development of Arabic emotion recognition systems in environments with limited resources.

Index Terms—Arabic language, attention mechanism, class imbalance, Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM), data augmentation, low-resource languages, Mel-Frequency Cepstral Coefficient (MFCC) features, speech emotion recognition

Cite: Sarmad H. Alfarag, "Hybrid CNN-LSTM Architecture with MFCC-Based Class-Balanced Augmentation for Arabic Speech Emotion Recognition," International Journal of Electrical and Electronic Engineering & Telecommunications, vol. 15, no. 3, pp. 192-204, 2026. doi: 10.18178/ijeetc.15.3.192-204

Copyright © 2026 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

附件说明

PREVIOUS PAPER

Forecasting Generation of Motor Generators Using Neural Networks

NEXT PAPER

A Confidence-Weighted Memory-Augmented Multimodal Voice Agent for Real-Time Emotion-Aware Interaction

Home

Published Issues

Author Guide

Editor Guide

Reviewer Guide

journal menu