Home / Articles
EXTRACTING EMOTIONS USING TEXT AND SPEECH |
![]() |
Author Name SUBASH S and NALAVARASU N Abstract This project focuses on a multimodal approach to emotion detection, combining text and audio inputs for more accurate recognition of human emotions. The text-based emotion detection model is built using the BERT architecture, leveraging the GoEmotions dataset, which is designed for fine- grained emotion classification. Simultaneously, the audio model uses a pre-trained Wav2Vec 2.0 model to extract emotions from speech. By fusing the features from both text and audio, the system enhances its ability to predict emotions more reliably than single-modality models. To integrate these two modalities, the project implements a probability-based fusion method. The model outputs emotion predictions from both the text and audio inputs, calculates the likelihood for each emotion, and combines these probabilities to generate a final prediction. This approach mitigates errors that could arise if either the text or audio model alone provides an incorrect emotion prediction. Additionally, the project incorporates real-time performance considerations, using the Whisper API for efficient audio transcription to text. This transcription further aids in refining emotion predictions. The system's final goal is to optimize performance through fine- tuning, adjusting parameters like the number of training epochs and regularization techniques to achieve the best accuracy.
Key Words: Multimodal emotion detection, Whisper API, GoEmotions dataset, BERT, Wav2Vec 2.0, Speech-to-text transcription
Published On : 2024-12-04 Article Download : ![]() |