AI Treesa Binu Sebastian

Treesa Binu Sebastian

Nominated Award: Best Application of AI In a Student Project

Linkedin of Person: https://www.linkedin.com/in/tre esa-binu-sebastian-
23aaab72/?originalSubdomain=ie

Student Profile:
Treesa Binu Sebastian is working as a Lead Automation Engineer with over 7 years of experience in test automation and 12 years in development. Tressa has a Bachelors Degree of Computer Applications from IGNOU, Delhi, India. During her studies and working life Tress h as gain experience in the following technologies Java .NET (ASP, C#), VB, C, C++,TypeScript, JavaScript, VB Script, NodeJS, Selenium, Protractor, BDD – Cucumber, jSon, HTML, CSS, XML, jQuery, REST API, Appium, UFT, Perfecto, SQL server, MySQL, Oracle, ALM.

Treesa has been living in Ireland for several years and not only works full time at Tata Consultancy Services is also a Mum of a young family.

In March 2020 Treesa joined our Certificate in Foundations of Artificial Intelligence which Krisolis deliver in partnership with Women in AI, TU Dublin and Technology Ireland ICT Skill net. The programme is certified by TU Dublin and participants earn a 10 ECT S CPD Award at NFQ Level 8. The programme is delivered over 12 weeks with 5 hours of class each week and students:
• Learn about AI an d its real-world applications
• Learn to Program in Python • Develop machine learning solutions • Create effective data visualisations
• Become versed in the ethical considerations of AI
• Apply all the above in your own project supported by 121 mentoring

Treesa came to the programme with a no knowledge of AI and no experience with Python. Over the 12-week programme Treesa exceeded all expectations of what we expected from a student. She exceled on the core material such as python, machine learning and data visualisation. Treesa had previous experiences working with sound and so und waves and was interested in exploring this topic from an AI perspective . With this in mind Tressa did find a Kaggle competition where the task was to predict emotion in from voice recordings. Although we did cover predict ion models this project would require implementing some Deep Leaning applications that was not covered on the programme. Treesa was happy to work independently and went on to implement a deep learning solution. Given Treesa starting point in terms of knowledge the learning progression that she showed throughout the program was exceptional and was best demonstrated in the project that she completed as part of the programme. It is for this project that we are recommending Treesa for the Application of AI in a Student Project.

Reason for Nomination:

Here we will give details of the project:

Title: The Acoustics of Feeling, Emotion Detection Through Speech Analysis Project

Objective: Speech is one of the most natural ways for humans to express themselves. It is completely obvious that feelings play a significant role in people’s decisions. My aim here is to analyze the acoustic characteristics of recorded audio data to identify underlying emotions in recorded speech. A technology like this might be useful in a variety of contexts, including interactive voice-based assistants, customer service, education, forensics and medical analysis.

Solution: The four key phases of the SER system are as follows:
– The first step is to gather a library of voice samples.
– The data would then be transformed into features. – The next stage is to figure out which characteristics were most important in distinguishing each emotion.
– These characteristics are then sent into a machine learning classifier, which recognizes them.

Data Sample: The data for this research was gathered from four different, publicly available data sources, as stated below.
– The Ryers on Audio-Visual Database of Emotional Speech and Song (RAVDESS): There are a total of 7356 files, recorded by 24 professional actors (12 female and 12 male) in a neutral north American dialect, portraying emotions such as calm, happy, sad, angry, fearful, surprise, and disgust.
– Toronto Emotional Speech Set (TESS): Two actresses, aged 26 and 64, recorded 2800 stimuli in total, indicating seven emotions: anger, disgust, fear, happiness, pleasant surprise, sad, and neutral.
– Surrey Audio-Visual Expressed Emotion (SAVEE): There are 480 total utterances, recorded by four British male actors aged 27 to 31, expressing six emotions: anger, disgust, fear, happiness, sadness and surprise. – Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D): Contains 7,442 clips featuring 91 actors and actresses of various ages and ethnicities exhibiting six emotions: happy, sad , angry, fear, disgust, and neutral.

Model: I experimented with different feature selection algorithms and observed that correlation-based feature selection gave the best results. They were all implemented in Python and made use of specialized libraries that provide optimization techniques and evaluation metrics. The results of a grid search to find the best parameters are listed below.
-Best for SVM: C=0.001, gamma=0.001, kernel=’poly’
-Best for RandomForestClassifier: {‘max_depth’: 7 , ‘max_features’: 0.5, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2, ‘n_es timators’: 40}
-Best for DecisionTreeClassifier: {‘criterion’: ‘en tropy’, ‘max_depth’: 7, ‘max_features’: None, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2}
-Best for MLPClassifier: {‘alpha’: 0.005, ‘batch _size’: 256, ‘hidden_layer_sizes’: (300,), ‘learning_rate’: ‘adaptive’, ‘ma x_iter’: 500}

I also explored the use of CNN’s:
– 1D CNN: My CNN network was made up of three blocks: a 1-dimensional convolution layer, an activation function (‘reLu’), a 1-dimensiona l pooling layer, and a dropout. As we are dealing with a multi-class problem, the three blocks were followed by two completely linked dense layers and a ‘SoftMax’ activation function:
– 2D-CNN: Tensorflow-Kera s was used to implement the CNN architecture. The neural network is initialized using a sequential method. Convolution2D is used to create the image-processing convolutional network. The pooling layers are added using the Max Pooling2D layer. Flatten is the function that transforms the pooled feature map into a single column that is then passed to the fully connected layer. Dense incorporates the fully connected layer into the neural network.

Conclusions: In terms of gender distribution, the number of female speakers was found to be somewhat higher than the number of male speakers, but the divergence was not significant enough to attract particular attention. After listening to a few sound files, I discovered that male and female emotions are expressed differently. Here are some of the findings:
– The loudness of Male’s Angry emotion is simply raised.
– The loudness of females’ happy, angry, and sad sounds is raised in volume.
– Female Disgust would add a throwing sound on the inside.

After the initial analysis I aimed at a more complex model-classifying different emotions based on gender data. The distribution of the samples was more balanced than the one I had before and I applied the same processing to the data as I have before and ran the exact same model. As expected, this classification seems relatively balanced, and with fairly good results as well. Th e Confusion matrix, showed that the vast majority of samples were classified correctly…..

Additional Information:

Conclusions Continued…..
Following that, I trained both male and female data individually to investigate the benchmark. The male and female models act differently, according to an analysis of the confusion matrix. According to the observation from the EDA part, I believe female Angry and Happy are extremely prone to get mixed up because their expressive technique is just raising the loudness of the voice.
– Male: In the male model, the dominating anticipated classes are angry and happy, although they are unlikely to combine.
– Female: In the female model, Sad and Happy are the major predicted classes, while Angry and Happy are extremely likely to mix. Real Time Prediction: thorough examination of real-time prediction reveals that.
– The model frequently gets mixed up between angry and disgusted.
– The model also became confused between the low energy emotions of sorrow and neutral.
– When one or two words are pronounced louder than others, especially at the beginning or conclusion of a phrase, it nearly usually indicates fear or astonishment.

The following are some suggested measures that may be taken in future to make the models more stable and accurate.
– Utilizing an ensemble of lexical and acoustic models to approach SER using a lexical features-based strategy. Because some emotional expression is contextual rather than verbal, this will increase the system’s accuracy .
– Finding more annotated audio clips or using other augmentation techniques such as time-shifting or speeding up/slowing the audio to provide more data volume.
– figure out how to remove the audio clip’s odd silence.
– An exact implementation of the speaking tempo can be investigated to see whether it can address any of the model’s flaws.
– Investigating additional acoustic characteristics of sound data to see whether they may be used in speech emotion identification.
_________________________________________________________
I hope the excerpts from Treesa’s project submissions give you a flavour for the exceptional level of skill that Treesa obtained in our short 12 programme. The programme was designed to introduce students to the fundamentals of AI with the hope that after 12 week they be able to implement som e basic machine learning models, never did we expect that a student with no python skills and no previous machine learning knowledge would reach the s kill level displayed by Treesa. I would also like to point out that our course is a part-time so Treesa completed this programme while also working full time and being a Mum of young children, which I think makes her achievement even more impressive. I really feel that Treesa had delivered a wonderful project that is truly deserving of your consideration to be shortlisted for this award.