# Importing necessary libraries
import os # Operating system functionalities
import librosa # Audio processing library
import wave # Module for reading and writing WAV files
import numpy as np # Numerical operations library
import pandas as pd # Data manipulation library
import matplotlib.pyplot as plt # Plotting library
# Importing components for Dividing into the Training Set and the Testing Set
from sklearn.model_selection import train_test_split # Splitting the dataset for training and testing
# Importing components for Long Short-Term Memory (LSTM) Classifier
import keras # High-level neural networks API
from tensorflow.keras.utils import to_categorical # Utility for one-hot encoding
from keras.models import Sequential # Sequential model for stacking layers
from keras.layers import * # Different layers for building neural networks
from keras.optimizers import rmsprop # Optimizer for training the model
As I start my Real-Time Speech Emotion Recognition project, I begin by importing the necessary libraries and components for building and evaluating machine learning models.
For the Deep Learning part of my project, I bring in components related to the Long Short-Term Memory (LSTM) Classifier using Keras.
These imported libraries and components form the foundation for my Real-Time Speech Emotion Recognition project, enabling me to handle audio data, split datasets, build LSTM models, and assess their performance.
As I embark on developing my Real-Time Speech Emotion Recognition project, I've opted to utilize the "ravdess-emotional-speech-audio" dataset due to its richness and suitability for training emotion recognition models.
# Ryerson Audio-Visual Database of Emotional Speech and Song (ravdess)
The "ravdess-emotional-speech-audio" dataset is a resource I've carefully chosen for its comprehensive coverage of emotional speech. It is a creation of Ryerson University and boasts a total of 1440 audio files, each lasting approximately 3-5 seconds.
# Diverse Emotional States
One of the strengths of this dataset is its diverse set of emotional states, including neutral, calm, happy, sad, angry, fearful, disgust, and surprised. Professional actors were involved in the creation, and they followed scripted scenarios to express these emotions, ensuring a controlled and standardized representation.
# Actor Diversity
The dataset encompasses 24 actors, split evenly between male and female, contributing to the richness of vocal characteristics, accents, and expressive styles. This diversity is instrumental in building a robust Speech Emotion Recognition (SER) model capable of handling variations encountered in real-world scenarios.
# Audio Characteristics
The audio recordings are sampled at a rate of 48 kHz and saved in the WAV file format, maintaining a high-quality standard suitable for training deep neural networks. Additionally, the dataset provides a corresponding CSV file containing metadata such as emotion labels, actor information, file paths, and file names. This metadata is invaluable for data preprocessing and model evaluation.
# Focus on Speech Segments
Given my project's emphasis on real-time speech emotion recognition, I've chosen to concentrate specifically on the speech segments within the dataset. This focused approach aligns more closely with the application domain, making it particularly relevant for applications like virtual assistants, customer service, and mental health support.
# Conclusion
In summary, the "ravdess-emotional-speech-audio" dataset stands out as a comprehensive and well-annotated resource for training and evaluating Real-Time Speech Emotion Recognition models. Its diverse emotions, multiple actors, and high-quality audio recordings make it an ideal choice for developing a robust and effective emotion recognition system tailored to my project's objectives.
def extract_mfcc(wav_file_name):
'''This function retrieves the mean of MFCC features from an input WAV file located
at the specified path. The input is the path to the WAV file, and the output is
the resulting MFCC features.'''
# Loading the WAV file using librosa and obtaining the audio signal (y) and sampling rate (sr)
y, sr = librosa.load(wav_file_name)
# Extracting MFCC features with a total of 40 coefficients, and computing the mean across dimensions
mfccs = np.mean(librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40).T, axis=0)
# Returning the resulting MFCC features
return mfccs
# Lists to store labels and extracted MFCC features for the Ravdess emotional speech dataset
ravdess_speech_labels = []
ravdess_speech_data = []
# Iterating through the files in the specified directory
for dirname, _, filenames in os.walk('./ravdess-emotional-speech-audio/'):
for filename in filenames:
# Extracting emotion label from the filename and converting to an integer
ravdess_speech_labels.append(int(filename[7:8]) - 1)
# Obtaining the full path of the WAV file
wav_file_name = os.path.join(dirname, filename)
# Extracting MFCC features from the WAV file using the previously defined function
ravdess_speech_data.append(extract_mfcc(wav_file_name))
In my Real-Time Speech Emotion Recognition project, I have a crucial function called extract_mfcc
that plays a central role. This function takes the path to a WAV file as input and retrieves the mean of Mel-Frequency Cepstral Coefficients (MFCC) features. It utilizes the librosa library to load the WAV file, obtaining the audio signal (y
) and the sampling rate (sr
).
Next, I use librosa again to extract the MFCC features. I specify parameters such as the number of coefficients (n_mfcc
) as 40. The resulting features are then transposed and the mean is computed across dimensions.
Finally, the function returns the computed MFCC features.
Moving on to the main portion of the code, I'm populating lists (ravdess_speech_labels
and ravdess_speech_data
) to store emotion labels and the corresponding MFCC features for each WAV file in the Ravdess emotional speech dataset.
This loop navigates through the files in the specified directory, extracting emotion labels from the filenames and converting them to integers. It then obtains the full path of each WAV file and extracts the corresponding MFCC features using the previously defined extract_mfcc
function. The resulting lists are crucial components for training and evaluating my Real-Time Speech Emotion Recognition model.
# Converting the list of MFCC features into a NumPy array
ravdess_speech_data_array = np.asarray(ravdess_speech_data)
# Converting the list of emotion labels into a NumPy array
ravdess_speech_label_array = np.array(ravdess_speech_labels)
# Converting the integer labels into categorical format using one-hot encoding
labels_categorical = to_categorical(ravdess_speech_label_array)
# Displaying the shapes of the MFCC data array and the categorical label array
ravdess_speech_data_array.shape, labels_categorical.shape
((2880, 40), (2880, 8))
In my Real-Time Speech Emotion Recognition project, I've reached a crucial stage where I'm preparing the data for training my machine learning model.
Here, I convert the list of extracted Mel-Frequency Cepstral Coefficients (MFCC) features (ravdess_speech_data
) into a NumPy array (ravdess_speech_data_array
). This transformation is essential for efficient data handling and compatibility with machine learning algorithms.
Similarly, I convert the list of emotion labels (ravdess_speech_labels
) into a NumPy array (ravdess_speech_label_array
). This array will serve as the ground truth labels for training my model.
Finally, I display the shapes of the MFCC data array (ravdess_speech_data_array
) and the categorical label array (labels_categorical
). This is a quick check to ensure that the data has been correctly processed and is ready for training the Real-Time Speech Emotion Recognition model. The shapes provide insight into the dimensions of the data, a crucial step in confirming the data's suitability for model training.
# Splitting the dataset into training and testing sets using train_test_split
x_train, x_test, y_train, y_test = train_test_split(np.array(ravdess_speech_data_array),
labels_categorical, test_size=0.2,
random_state=9)
# Calculating the total number of samples in the dataset
number_of_samples = ravdess_speech_data_array.shape[0]
# Determining the number of samples for training, validation, and testing sets
training_samples = int(number_of_samples * 0.8)
validation_samples = int(number_of_samples * 0.1)
test_samples = int(number_of_samples * 0.1)
In my Real-Time Speech Emotion Recognition project, I've now reached the crucial step of splitting my dataset into training and testing sets. I achieve this using the train_test_split
function, ensuring that the features (x_train
and x_test
) and labels (y_train
and y_test
) are appropriately assigned. I've opted for an 80-20 split, designating 20% of the data for testing, and I've set a random seed (random_state=9
) for reproducibility.
After this split, I proceed to calculate the total number of samples in my dataset (number_of_samples
). This information is crucial for determining the number of samples I'll allocate for training, validation, and testing sets based on a predefined distribution ratio.
Next, I determine the number of samples for each set. For training, I allocate 80% of the total samples; for validation and testing, I assign 10% each.
This meticulous allocation ensures that I have a well-balanced dataset for training and evaluating my Real-Time Speech Emotion Recognition model.
# Function to create an LSTM model for Speech Emotion Recognition
def create_model_LSTM():
# Initializing a sequential model
model = Sequential()
# Adding an LSTM layer with 128 units, not returning sequences, and input shape of (40, 1)
model.add(LSTM(128, return_sequences=False, input_shape=(40, 1)))
# Adding a Dense layer with 64 units
model.add(Dense(64))
# Adding a Dropout layer with a dropout rate of 40%
model.add(Dropout(0.4))
# Adding an Activation layer with ReLU activation function
model.add(Activation('relu'))
# Adding another Dense layer with 32 units
model.add(Dense(32))
# Adding a Dropout layer with a dropout rate of 40%
model.add(Dropout(0.4))
# Adding an Activation layer with ReLU activation function
model.add(Activation('relu'))
# Adding another Dense layer with 8 units
model.add(Dense(8))
# Adding an Activation layer with softmax activation function for multiclass classification
model.add(Activation('softmax'))
# Compiling the model with categorical crossentropy loss, Adam optimizer, and accuracy metric
model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
# Returning the compiled model
model.summary()
return model
In my Real-Time Speech Emotion Recognition project, I've created a dedicated function, create_model_LSTM
, to define the architecture of my LSTM (Long Short-Term Memory) neural network for the task at hand.
# Initializing a sequential model
model = Sequential()
I begin by initializing a sequential model, which allows me to build the neural network layer by layer in a sequential manner.
# Adding an LSTM layer with 128 units, not returning sequences, and input shape of (40, 1)
model.add(LSTM(128, return_sequences=False, input_shape=(40, 1)))
The first layer is an LSTM layer with 128 units. It doesn't return sequences, and it expects input data with a shape of (40, 1), which aligns with the Mel-Frequency Cepstral Coefficients (MFCC) features.
# Adding a Dense layer with 64 units
model.add(Dense(64))
Following the LSTM layer, I add a Dense layer with 64 units, which introduces a level of complexity to the network.
# Adding a Dropout layer with a dropout rate of 40%
model.add(Dropout(0.4))
To prevent overfitting, I include a Dropout layer with a dropout rate of 40%, which randomly drops a proportion of connections during training.
# Adding an Activation layer with ReLU activation function
model.add(Activation('relu'))
An Activation layer with the Rectified Linear Unit (ReLU) activation function is added to introduce non-linearity to the model.
# Adding another Dense layer with 32 units
model.add(Dense(32))
I continue by adding another Dense layer with 32 units, further shaping the network's complexity.
# Adding a Dropout layer with a dropout rate of 40%
model.add(Dropout(0.4))
Again, to mitigate overfitting, I include another Dropout layer with a 40% dropout rate.
# Adding an Activation layer with ReLU activation function
model.add(Activation('relu'))
Another Activation layer with the ReLU activation function follows to enhance the non-linear characteristics of the model.
# Adding another Dense layer with 8 units
model.add(Dense(8))
I introduce another Dense layer with 8 units, potentially capturing higher-level features in the data.
# Adding an Activation layer with softmax activation function for multiclass classification
model.add(Activation('softmax'))
The final layer is an Activation layer with the softmax activation function, suitable for multiclass classification tasks like emotion recognition.
# Compiling the model with categorical crossentropy loss, Adam optimizer, and accuracy metric
model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
I compile the model with categorical crossentropy loss, the Adam optimizer, and use accuracy as the metric to optimize during training.
# Returning the compiled model
return model
The function concludes by returning the compiled LSTM model, ready for training and evaluating on the Real-Time Speech Emotion Recognition dataset.
LSTM_model = create_model_LSTM()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm (LSTM) (None, 128) 66560 dense (Dense) (None, 64) 8256 dropout (Dropout) (None, 64) 0 activation (Activation) (None, 64) 0 dense_1 (Dense) (None, 32) 2080 dropout_1 (Dropout) (None, 32) 0 activation_1 (Activation) (None, 32) 0 dense_2 (Dense) (None, 8) 264 activation_2 (Activation) (None, 8) 0 ================================================================= Total params: 77,160 Trainable params: 77,160 Non-trainable params: 0 _________________________________________________________________
LSTM_model_history = LSTM_model.fit(np.expand_dims(ravdess_speech_data_array[:training_samples],-1),
labels_categorical[:training_samples],
validation_data=(np.expand_dims(ravdess_speech_data_array[training_samples:training_samples+validation_samples], -1),
labels_categorical[training_samples:training_samples+validation_samples]), epochs=121, shuffle=True)
Epoch 1/121 72/72 [==============================] - 8s 53ms/step - loss: 2.0696 - accuracy: 0.1363 - val_loss: 2.0259 - val_accuracy: 0.2153 Epoch 2/121 72/72 [==============================] - 3s 42ms/step - loss: 2.0091 - accuracy: 0.1984 - val_loss: 1.9771 - val_accuracy: 0.2326 Epoch 3/121 72/72 [==============================] - 3s 41ms/step - loss: 1.9677 - accuracy: 0.2196 - val_loss: 1.8924 - val_accuracy: 0.2778 Epoch 4/121 72/72 [==============================] - 3s 41ms/step - loss: 1.9146 - accuracy: 0.2565 - val_loss: 1.8456 - val_accuracy: 0.3056 Epoch 5/121 72/72 [==============================] - 3s 41ms/step - loss: 1.8861 - accuracy: 0.2569 - val_loss: 1.8441 - val_accuracy: 0.2778 Epoch 6/121 72/72 [==============================] - 3s 40ms/step - loss: 1.8695 - accuracy: 0.2769 - val_loss: 1.8072 - val_accuracy: 0.3264 Epoch 7/121 72/72 [==============================] - 3s 41ms/step - loss: 1.8603 - accuracy: 0.2765 - val_loss: 1.7774 - val_accuracy: 0.3368 Epoch 8/121 72/72 [==============================] - 3s 41ms/step - loss: 1.8313 - accuracy: 0.2930 - val_loss: 1.8049 - val_accuracy: 0.3090 Epoch 9/121 72/72 [==============================] - 3s 40ms/step - loss: 1.8100 - accuracy: 0.3069 - val_loss: 1.7727 - val_accuracy: 0.2708 Epoch 10/121 72/72 [==============================] - 3s 39ms/step - loss: 1.8080 - accuracy: 0.2917 - val_loss: 1.7702 - val_accuracy: 0.2951 Epoch 11/121 72/72 [==============================] - 3s 41ms/step - loss: 1.7895 - accuracy: 0.3016 - val_loss: 1.7623 - val_accuracy: 0.3299 Epoch 12/121 72/72 [==============================] - 3s 41ms/step - loss: 1.7833 - accuracy: 0.3025 - val_loss: 1.7561 - val_accuracy: 0.3299 Epoch 13/121 72/72 [==============================] - 3s 41ms/step - loss: 1.7641 - accuracy: 0.3164 - val_loss: 1.7261 - val_accuracy: 0.3333 Epoch 14/121 72/72 [==============================] - 3s 41ms/step - loss: 1.7287 - accuracy: 0.3299 - val_loss: 1.7674 - val_accuracy: 0.3299 Epoch 15/121 72/72 [==============================] - 3s 41ms/step - loss: 1.7244 - accuracy: 0.3312 - val_loss: 1.6747 - val_accuracy: 0.3854 Epoch 16/121 72/72 [==============================] - 3s 41ms/step - loss: 1.6793 - accuracy: 0.3503 - val_loss: 1.6991 - val_accuracy: 0.3438 Epoch 17/121 72/72 [==============================] - 3s 42ms/step - loss: 1.6673 - accuracy: 0.3594 - val_loss: 1.6661 - val_accuracy: 0.3681 Epoch 18/121 72/72 [==============================] - 3s 41ms/step - loss: 1.6450 - accuracy: 0.3581 - val_loss: 1.6624 - val_accuracy: 0.3750 Epoch 19/121 72/72 [==============================] - 3s 41ms/step - loss: 1.6269 - accuracy: 0.3785 - val_loss: 1.6072 - val_accuracy: 0.3889 Epoch 20/121 72/72 [==============================] - 3s 39ms/step - loss: 1.6089 - accuracy: 0.3711 - val_loss: 1.5811 - val_accuracy: 0.3958 Epoch 21/121 72/72 [==============================] - 3s 41ms/step - loss: 1.5814 - accuracy: 0.3997 - val_loss: 1.5690 - val_accuracy: 0.3993 Epoch 22/121 72/72 [==============================] - 3s 41ms/step - loss: 1.5835 - accuracy: 0.3928 - val_loss: 1.5944 - val_accuracy: 0.3854 Epoch 23/121 72/72 [==============================] - 3s 42ms/step - loss: 1.5378 - accuracy: 0.4054 - val_loss: 1.5394 - val_accuracy: 0.4132 Epoch 24/121 72/72 [==============================] - 3s 42ms/step - loss: 1.5347 - accuracy: 0.4097 - val_loss: 1.5296 - val_accuracy: 0.4340 Epoch 25/121 72/72 [==============================] - 3s 41ms/step - loss: 1.5100 - accuracy: 0.4210 - val_loss: 1.5107 - val_accuracy: 0.4201 Epoch 26/121 72/72 [==============================] - 3s 41ms/step - loss: 1.4782 - accuracy: 0.4227 - val_loss: 1.4898 - val_accuracy: 0.4271 Epoch 27/121 72/72 [==============================] - 3s 41ms/step - loss: 1.4612 - accuracy: 0.4423 - val_loss: 1.4778 - val_accuracy: 0.4306 Epoch 28/121 72/72 [==============================] - 3s 41ms/step - loss: 1.4229 - accuracy: 0.4640 - val_loss: 1.4724 - val_accuracy: 0.4201 Epoch 29/121 72/72 [==============================] - 3s 42ms/step - loss: 1.4619 - accuracy: 0.4371 - val_loss: 1.5130 - val_accuracy: 0.4167 Epoch 30/121 72/72 [==============================] - 3s 41ms/step - loss: 1.4014 - accuracy: 0.4540 - val_loss: 1.4285 - val_accuracy: 0.4653 Epoch 31/121 72/72 [==============================] - 3s 42ms/step - loss: 1.3514 - accuracy: 0.4896 - val_loss: 1.3688 - val_accuracy: 0.4792 Epoch 32/121 72/72 [==============================] - 3s 42ms/step - loss: 1.3425 - accuracy: 0.4865 - val_loss: 1.3744 - val_accuracy: 0.4792 Epoch 33/121 72/72 [==============================] - 3s 42ms/step - loss: 1.2947 - accuracy: 0.5095 - val_loss: 1.3537 - val_accuracy: 0.4896 Epoch 34/121 72/72 [==============================] - 3s 42ms/step - loss: 1.2964 - accuracy: 0.5152 - val_loss: 1.3470 - val_accuracy: 0.5000 Epoch 35/121 72/72 [==============================] - 3s 42ms/step - loss: 1.2199 - accuracy: 0.5334 - val_loss: 1.3563 - val_accuracy: 0.4618 Epoch 36/121 72/72 [==============================] - 3s 41ms/step - loss: 1.2530 - accuracy: 0.5165 - val_loss: 1.3161 - val_accuracy: 0.5312 Epoch 37/121 72/72 [==============================] - 3s 41ms/step - loss: 1.1611 - accuracy: 0.5516 - val_loss: 1.2486 - val_accuracy: 0.5347 Epoch 38/121 72/72 [==============================] - 3s 41ms/step - loss: 1.1653 - accuracy: 0.5664 - val_loss: 1.2480 - val_accuracy: 0.5417 Epoch 39/121 72/72 [==============================] - 3s 41ms/step - loss: 1.1000 - accuracy: 0.5864 - val_loss: 1.2332 - val_accuracy: 0.5347 Epoch 40/121 72/72 [==============================] - 3s 41ms/step - loss: 1.0955 - accuracy: 0.5911 - val_loss: 1.2187 - val_accuracy: 0.5660 Epoch 41/121 72/72 [==============================] - 3s 41ms/step - loss: 1.1107 - accuracy: 0.5885 - val_loss: 1.1795 - val_accuracy: 0.5521 Epoch 42/121 72/72 [==============================] - 3s 41ms/step - loss: 1.0660 - accuracy: 0.6063 - val_loss: 1.1367 - val_accuracy: 0.5382 Epoch 43/121 72/72 [==============================] - 3s 42ms/step - loss: 1.0462 - accuracy: 0.6050 - val_loss: 1.2106 - val_accuracy: 0.5764 Epoch 44/121 72/72 [==============================] - 3s 40ms/step - loss: 1.0979 - accuracy: 0.5833 - val_loss: 1.1043 - val_accuracy: 0.5660 Epoch 45/121 72/72 [==============================] - 3s 41ms/step - loss: 1.0243 - accuracy: 0.6272 - val_loss: 1.1297 - val_accuracy: 0.5660 Epoch 46/121 72/72 [==============================] - 3s 42ms/step - loss: 0.9652 - accuracy: 0.6315 - val_loss: 0.9804 - val_accuracy: 0.6319 Epoch 47/121 72/72 [==============================] - 3s 41ms/step - loss: 0.9488 - accuracy: 0.6454 - val_loss: 1.0477 - val_accuracy: 0.5833 Epoch 48/121 72/72 [==============================] - 3s 43ms/step - loss: 0.9552 - accuracy: 0.6376 - val_loss: 1.1041 - val_accuracy: 0.5972 Epoch 49/121 72/72 [==============================] - 3s 42ms/step - loss: 0.9170 - accuracy: 0.6619 - val_loss: 0.9763 - val_accuracy: 0.6250 Epoch 50/121 72/72 [==============================] - 3s 42ms/step - loss: 0.9014 - accuracy: 0.6762 - val_loss: 1.0695 - val_accuracy: 0.5833 Epoch 51/121 72/72 [==============================] - 3s 43ms/step - loss: 0.8939 - accuracy: 0.6832 - val_loss: 0.9698 - val_accuracy: 0.6111 Epoch 52/121 72/72 [==============================] - 3s 44ms/step - loss: 0.8112 - accuracy: 0.7023 - val_loss: 0.9391 - val_accuracy: 0.6597 Epoch 53/121 72/72 [==============================] - 3s 42ms/step - loss: 0.9158 - accuracy: 0.6745 - val_loss: 1.0199 - val_accuracy: 0.6250 Epoch 54/121 72/72 [==============================] - 3s 42ms/step - loss: 0.7781 - accuracy: 0.7023 - val_loss: 0.9108 - val_accuracy: 0.6667 Epoch 55/121 72/72 [==============================] - 3s 43ms/step - loss: 0.7065 - accuracy: 0.7418 - val_loss: 0.8521 - val_accuracy: 0.6910 Epoch 56/121 72/72 [==============================] - 3s 41ms/step - loss: 0.7191 - accuracy: 0.7374 - val_loss: 0.7885 - val_accuracy: 0.6875 Epoch 57/121 72/72 [==============================] - 3s 41ms/step - loss: 0.7547 - accuracy: 0.7326 - val_loss: 0.8362 - val_accuracy: 0.6840 Epoch 58/121 72/72 [==============================] - 3s 42ms/step - loss: 0.7555 - accuracy: 0.7339 - val_loss: 0.9854 - val_accuracy: 0.6597 Epoch 59/121 72/72 [==============================] - 3s 42ms/step - loss: 0.8013 - accuracy: 0.7140 - val_loss: 0.8060 - val_accuracy: 0.7292 Epoch 60/121 72/72 [==============================] - 3s 41ms/step - loss: 0.6573 - accuracy: 0.7565 - val_loss: 0.7713 - val_accuracy: 0.6979 Epoch 61/121 72/72 [==============================] - 3s 42ms/step - loss: 0.8102 - accuracy: 0.7201 - val_loss: 0.8932 - val_accuracy: 0.6667 Epoch 62/121 72/72 [==============================] - 3s 42ms/step - loss: 0.7555 - accuracy: 0.7387 - val_loss: 0.7613 - val_accuracy: 0.7535 Epoch 63/121 72/72 [==============================] - 3s 41ms/step - loss: 0.6703 - accuracy: 0.7556 - val_loss: 0.8348 - val_accuracy: 0.7014 Epoch 64/121 72/72 [==============================] - 3s 41ms/step - loss: 0.5964 - accuracy: 0.7960 - val_loss: 0.7320 - val_accuracy: 0.7535 Epoch 65/121 72/72 [==============================] - 3s 41ms/step - loss: 0.5646 - accuracy: 0.7982 - val_loss: 0.6981 - val_accuracy: 0.7361 Epoch 66/121 72/72 [==============================] - 3s 42ms/step - loss: 0.5011 - accuracy: 0.8234 - val_loss: 0.6671 - val_accuracy: 0.7535 Epoch 67/121 72/72 [==============================] - 3s 43ms/step - loss: 0.5370 - accuracy: 0.8030 - val_loss: 0.6478 - val_accuracy: 0.7639 Epoch 68/121 72/72 [==============================] - 3s 42ms/step - loss: 0.5111 - accuracy: 0.8194 - val_loss: 0.7605 - val_accuracy: 0.7604 Epoch 69/121 72/72 [==============================] - 3s 42ms/step - loss: 0.6034 - accuracy: 0.7852 - val_loss: 0.6908 - val_accuracy: 0.7292 Epoch 70/121 72/72 [==============================] - 3s 40ms/step - loss: 0.5545 - accuracy: 0.8060 - val_loss: 0.6812 - val_accuracy: 0.7500 Epoch 71/121 72/72 [==============================] - 3s 41ms/step - loss: 0.4615 - accuracy: 0.8338 - val_loss: 0.5144 - val_accuracy: 0.8160 Epoch 72/121 72/72 [==============================] - 3s 42ms/step - loss: 0.4493 - accuracy: 0.8520 - val_loss: 0.5177 - val_accuracy: 0.8125 Epoch 73/121 72/72 [==============================] - 3s 41ms/step - loss: 0.5128 - accuracy: 0.8381 - val_loss: 0.5789 - val_accuracy: 0.7917 Epoch 74/121 72/72 [==============================] - 3s 41ms/step - loss: 0.4825 - accuracy: 0.8394 - val_loss: 0.7788 - val_accuracy: 0.7535 Epoch 75/121 72/72 [==============================] - 3s 41ms/step - loss: 0.4428 - accuracy: 0.8529 - val_loss: 0.4903 - val_accuracy: 0.8333 Epoch 76/121 72/72 [==============================] - 3s 42ms/step - loss: 0.3746 - accuracy: 0.8750 - val_loss: 0.4384 - val_accuracy: 0.8264 Epoch 77/121 72/72 [==============================] - 3s 41ms/step - loss: 0.5482 - accuracy: 0.8151 - val_loss: 0.5564 - val_accuracy: 0.7847 Epoch 78/121 72/72 [==============================] - 3s 41ms/step - loss: 0.4711 - accuracy: 0.8433 - val_loss: 0.4552 - val_accuracy: 0.8368 Epoch 79/121 72/72 [==============================] - 3s 41ms/step - loss: 0.3906 - accuracy: 0.8681 - val_loss: 0.3662 - val_accuracy: 0.8681 Epoch 80/121 72/72 [==============================] - 3s 42ms/step - loss: 0.3117 - accuracy: 0.8924 - val_loss: 0.3203 - val_accuracy: 0.8715 Epoch 81/121 72/72 [==============================] - 3s 40ms/step - loss: 0.2991 - accuracy: 0.9019 - val_loss: 0.6930 - val_accuracy: 0.7986 Epoch 82/121 72/72 [==============================] - 3s 41ms/step - loss: 0.7354 - accuracy: 0.7821 - val_loss: 0.7330 - val_accuracy: 0.7326 Epoch 83/121 72/72 [==============================] - 3s 41ms/step - loss: 0.5355 - accuracy: 0.8142 - val_loss: 0.4592 - val_accuracy: 0.8229 Epoch 84/121 72/72 [==============================] - 3s 42ms/step - loss: 0.3769 - accuracy: 0.8728 - val_loss: 0.3892 - val_accuracy: 0.8368 Epoch 85/121 72/72 [==============================] - 3s 41ms/step - loss: 0.2943 - accuracy: 0.9006 - val_loss: 0.4188 - val_accuracy: 0.8542 Epoch 86/121 72/72 [==============================] - 3s 40ms/step - loss: 0.2862 - accuracy: 0.9023 - val_loss: 0.4013 - val_accuracy: 0.8368 Epoch 87/121 72/72 [==============================] - 3s 40ms/step - loss: 0.2716 - accuracy: 0.9119 - val_loss: 0.3176 - val_accuracy: 0.8681 Epoch 88/121 72/72 [==============================] - 3s 40ms/step - loss: 0.2658 - accuracy: 0.9084 - val_loss: 0.2618 - val_accuracy: 0.9132 Epoch 89/121 72/72 [==============================] - 3s 41ms/step - loss: 0.2583 - accuracy: 0.9184 - val_loss: 0.2542 - val_accuracy: 0.9062 Epoch 90/121 72/72 [==============================] - 3s 41ms/step - loss: 0.2715 - accuracy: 0.9058 - val_loss: 0.2833 - val_accuracy: 0.8993 Epoch 91/121 72/72 [==============================] - 3s 42ms/step - loss: 0.3106 - accuracy: 0.9045 - val_loss: 0.4432 - val_accuracy: 0.8403 Epoch 92/121 72/72 [==============================] - 3s 41ms/step - loss: 0.3749 - accuracy: 0.8806 - val_loss: 0.5243 - val_accuracy: 0.8299 Epoch 93/121 72/72 [==============================] - 3s 42ms/step - loss: 0.3316 - accuracy: 0.8980 - val_loss: 0.3444 - val_accuracy: 0.8715 Epoch 94/121 72/72 [==============================] - 3s 42ms/step - loss: 0.2420 - accuracy: 0.9214 - val_loss: 0.4278 - val_accuracy: 0.8715 Epoch 95/121 72/72 [==============================] - 3s 41ms/step - loss: 0.3253 - accuracy: 0.9019 - val_loss: 0.4955 - val_accuracy: 0.8333 Epoch 96/121 72/72 [==============================] - 3s 41ms/step - loss: 0.3456 - accuracy: 0.8902 - val_loss: 0.3410 - val_accuracy: 0.8681 Epoch 97/121 72/72 [==============================] - 3s 41ms/step - loss: 0.2441 - accuracy: 0.9162 - val_loss: 0.2478 - val_accuracy: 0.9167 Epoch 98/121 72/72 [==============================] - 3s 40ms/step - loss: 0.1982 - accuracy: 0.9423 - val_loss: 0.2683 - val_accuracy: 0.9028 Epoch 99/121 72/72 [==============================] - 3s 40ms/step - loss: 0.2195 - accuracy: 0.9266 - val_loss: 0.3958 - val_accuracy: 0.9028 Epoch 100/121 72/72 [==============================] - 3s 42ms/step - loss: 0.2549 - accuracy: 0.9262 - val_loss: 0.3111 - val_accuracy: 0.8958 Epoch 101/121 72/72 [==============================] - 3s 40ms/step - loss: 0.2779 - accuracy: 0.9167 - val_loss: 0.2111 - val_accuracy: 0.9306 Epoch 102/121 72/72 [==============================] - 3s 41ms/step - loss: 0.1911 - accuracy: 0.9362 - val_loss: 0.1560 - val_accuracy: 0.9444 Epoch 103/121 72/72 [==============================] - 3s 41ms/step - loss: 0.3058 - accuracy: 0.9210 - val_loss: 0.4017 - val_accuracy: 0.8854 Epoch 104/121 72/72 [==============================] - 3s 40ms/step - loss: 0.3404 - accuracy: 0.9028 - val_loss: 0.2192 - val_accuracy: 0.9201 Epoch 105/121 72/72 [==============================] - 3s 41ms/step - loss: 0.1834 - accuracy: 0.9457 - val_loss: 0.1900 - val_accuracy: 0.9444 Epoch 106/121 72/72 [==============================] - 3s 42ms/step - loss: 0.1556 - accuracy: 0.9501 - val_loss: 0.1364 - val_accuracy: 0.9444 Epoch 107/121 72/72 [==============================] - 3s 40ms/step - loss: 0.1935 - accuracy: 0.9475 - val_loss: 0.1739 - val_accuracy: 0.9375 Epoch 108/121 72/72 [==============================] - 3s 40ms/step - loss: 0.4024 - accuracy: 0.8915 - val_loss: 0.3481 - val_accuracy: 0.8854 Epoch 109/121 72/72 [==============================] - 3s 40ms/step - loss: 0.3336 - accuracy: 0.9054 - val_loss: 0.2738 - val_accuracy: 0.9062 Epoch 110/121 72/72 [==============================] - 3s 41ms/step - loss: 0.2766 - accuracy: 0.9253 - val_loss: 0.1745 - val_accuracy: 0.9340 Epoch 111/121 72/72 [==============================] - 3s 40ms/step - loss: 0.1486 - accuracy: 0.9527 - val_loss: 0.1006 - val_accuracy: 0.9688 Epoch 112/121 72/72 [==============================] - 3s 39ms/step - loss: 0.1097 - accuracy: 0.9688 - val_loss: 0.0689 - val_accuracy: 0.9896 Epoch 113/121 72/72 [==============================] - 3s 41ms/step - loss: 0.1067 - accuracy: 0.9679 - val_loss: 0.0684 - val_accuracy: 0.9861 Epoch 114/121 72/72 [==============================] - 3s 41ms/step - loss: 0.1164 - accuracy: 0.9653 - val_loss: 0.2498 - val_accuracy: 0.9306 Epoch 115/121 72/72 [==============================] - 3s 41ms/step - loss: 0.3515 - accuracy: 0.9049 - val_loss: 0.2616 - val_accuracy: 0.9167 Epoch 116/121 72/72 [==============================] - 3s 40ms/step - loss: 0.3345 - accuracy: 0.9049 - val_loss: 0.3736 - val_accuracy: 0.9132 Epoch 117/121 72/72 [==============================] - 3s 41ms/step - loss: 0.2500 - accuracy: 0.9314 - val_loss: 0.1787 - val_accuracy: 0.9444 Epoch 118/121 72/72 [==============================] - 3s 41ms/step - loss: 0.1579 - accuracy: 0.9544 - val_loss: 0.1088 - val_accuracy: 0.9688 Epoch 119/121 72/72 [==============================] - 3s 41ms/step - loss: 0.1505 - accuracy: 0.9566 - val_loss: 0.0695 - val_accuracy: 0.9757 Epoch 120/121 72/72 [==============================] - 3s 41ms/step - loss: 0.0911 - accuracy: 0.9727 - val_loss: 0.0549 - val_accuracy: 0.9861 Epoch 121/121 72/72 [==============================] - 3s 40ms/step - loss: 0.0843 - accuracy: 0.9774 - val_loss: 0.0231 - val_accuracy: 0.9965
In my Real-Time Speech Emotion Recognition project, I am training my LSTM model using the fit
method. Here's a breakdown of what each part of the code does:
np.expand_dims(ravdess_speech_data_array[:training_samples], -1)
: I am expanding the dimensions of the training data to make it compatible with the LSTM model, which expects a 3D input shape. The -1
argument adds an extra dimension at the end.
labels_categorical[:training_samples]
: These are the corresponding categorical emotion labels for the training data.
validation_data=(np.expand_dims(ravdess_speech_data_array[training_samples:training_samples+validation_samples], -1), labels_categorical[training_samples:training_samples+validation_samples])
: Similar to the training data, I prepare the validation data with its corresponding labels.
epochs=121
: I have chosen to train the model for 121 epochs. Adjust this value based on your specific training requirements.
shuffle=True
: I shuffle the training data during each epoch to introduce randomness and prevent the model from memorizing the order of the data.
The training history (LSTM_model_history
) is stored, containing information about the training and validation metrics for each epoch. This information is useful for evaluating the model's performance and making decisions about further training or adjustments.
def plot_metric(model_training_history, metric_name_1, metric_name_2, plot_name):
'''
This function is designed to create a graph displaying the provided metrics.
Parameters:
model_training_history: A history object containing recorded training and validation
loss values and metric values across consecutive epochs.
metric_name_1: The name of the first metric to be visualized in the graph.
metric_name_2: The name of the second metric to be visualized in the graph.
plot_name: The title of the graph.
'''
# Extract metric values from the training history.
metric_value_1 = model_training_history.history[metric_name_1]
metric_value_2 = model_training_history.history[metric_name_2]
# Generate a range of epochs for x-axis.
epochs = range(len(metric_value_1))
# Plot the first metric in blue.
plt.plot(epochs, metric_value_1, 'blue', label=metric_name_1)
# Plot the second metric in red.
plt.plot(epochs, metric_value_2, 'red', label=metric_name_2)
# Set the title of the graph.
plt.title(str(plot_name))
# Add a legend to the graph.
plt.legend()
# Plot the training and validation loss metrics for visualization.
plot_metric(LSTM_model_history, 'loss', 'val_loss', 'Total Loss vs Total Validation Loss')
# Plot the training and validation loss metrics for visualization.
plot_metric(LSTM_model_history, 'accuracy', 'val_accuracy', 'Total Accuracy vs Total Validation Accuracy')
model_evaluation_history = LSTM_model.evaluate(np.expand_dims(ravdess_speech_data_array[training_samples + validation_samples:], -1),
labels_categorical[training_samples + validation_samples:])
9/9 [==============================] - 1s 21ms/step - loss: 0.0434 - accuracy: 0.9861
After training my LSTM model, I'm evaluating its performance on a separate set of data using the evaluate
method. Let me explain this part of the code:
np.expand_dims(ravdess_speech_data_array[training_samples + validation_samples:], -1)
: I'm expanding the dimensions of the evaluation data similarly to how I did during training. This ensures compatibility with the LSTM model, which expects 3D input.
labels_categorical[training_samples + validation_samples:]
: These are the corresponding categorical emotion labels for the evaluation data.
(9/9 [==============================] - 1s 21ms/step - loss: 0.0434 - accuracy: 0.9861
) is the result of the evaluation:
9/9
: It indicates that the evaluation was performed on 9 batches.
[==============================]
: This visual representation shows the progress of evaluating the batches.
- 1s 21ms/step
: It took approximately 1 second to evaluate each step, with each step corresponding to one batch.
loss: 0.0434
: The calculated loss on the evaluation data is 0.0434. This metric indicates how well the model is performing, with lower values being better.
accuracy: 0.9861
: The accuracy of the model on the evaluation data is 98.61%. This represents the proportion of correctly classified instances, a key metric for assessing the model's performance.
import datetime as dt
# Retrieve loss and accuracy from the model evaluation history.
model_evaluation_loss, model_evaluation_accuracy = model_evaluation_history
# Define the date and time format.
date_time_format = '%Y_%m_%d_%H_%M_%S'
# Obtain the current date and time.
current_date_time_dt = dt.datetime.now()
# Convert the date and time to a string with the specified format.
current_date_time_string = dt.datetime.strftime(current_date_time_dt, date_time_format)
# Construct a unique file name based on date, time, loss, and accuracy.
model_file_name = f'LSTM_model_Date_Time_{current_date_time_string}___Loss_{model_evaluation_loss}___Accuracy_{model_evaluation_accuracy}.h5'
# Save the ConvLSTM model with the generated file name.
LSTM_model.save(model_file_name)
After evaluating my LSTM model, I am saving it with a unique file name that includes the current date, time, loss, and accuracy. Let me break down the code:
I extract the loss and accuracy values obtained from the model evaluation history, which were calculated during the evaluation step.
I define the format in which I want to represent the date and time. In this case, it's a format that includes the year, month, day, hour, minute, and second.
I get the current date and time using the datetime.now()
function from the datetime
module.
I convert the obtained date and time into a string using the specified format.
I create a unique file name for the saved model by incorporating the current date, time, evaluation loss, and evaluation accuracy into the string. This ensures that each saved model has a distinct identifier.
Finally, I save the trained LSTM model using the generated unique file name. This step is essential for keeping track of model versions and understanding the performance of each model based on its evaluation results.
I have also developed a Streamlit one-click version of the Speech Emotion Recognition model, making it incredibly user-friendly. With this version, users can effortlessly recognize emotions in a speech by simply clicking a single button to upload an audio.
To explore the Streamlit version, click this button below: