Inspiration

In the contemporary landscape, the urgency for an accurate and accessible method to detect depression has never been more apparent. As society contends with increasing stressors, a growing segment of the population grapples with depressive tendencies. Recognizing the need for early intervention, our research is motivated by the imperative to develop a model that autonomously detects depression. We propose leveraging three essential modalities derived from clinical interviews to validate our model: physiological, speech, and visual cues. Extensive research underscores the intricate signs of depression, which are best captured through the simultaneous study of these three modalities. Changes in mental behaviour during depression are reflected in physiological and speech alterations, including stammering, uneven pauses, and altered pronunciation. The video modality adds a behavioural dimension, capturing abnormal eye contact, reduced mouth movement, and changed posture. Integrating lexical analysis provides valuable context, enriching our understanding of the subject's mental state.

Problem Statement

Depression whispers in many voices, often beyond the reach of words alone. It murmurs in speech hesitations, the fleeting shadows on a face, and the veiled emotions embedded in language. Traditional methods, relying solely on self-reported symptoms, are deaf to these whispers, leaving millions shrouded in silent suffering. **OUR CHALLENGE **: to build a model that becomes a skilled translator, deciphering the multifaceted language of depression. We need to weave together a tapestry of audio, video, and text, listening to what's said, how it's said, and what the body reveals. This tapestry, more prosperous and more nuanced, will paint a more accurate picture of a person’s mental state, revealing depression's hidden secrets before they take root. But this quest demands overcoming formidable obstacles.

1. Orchestrate the cacophony of data: Aligning audio, video, and text in perfect harmony to capture the intricate correlations that whisper hidden truths.

2. Amplify the faintest signals: differentiating the subtle tremors in speech, the fleeting glances on video, And the veiled emotions within words from the background noise.

3. Personalize the diagnosis: crafting a model that adapts to the unique language of depression in each individual, transcending a one-size Fits all approach.

Introduction

In today's fast-paced world, the shadow of depression looms large. Recognizing its presence before it casts deeper darkness is crucial, yet relying solely on self-reported symptoms might miss the intricate whispers it leaves behind. Our project embarks on a journey to unveil depression's secrets, not just through what individuals say but by listening to the symphony of signals their bodies whisper – in their voices, expressions, and words.

Conventionally depression detection was done through extensive clinical interviews, wherein the subject’s responses are studied by the psychologist to determine his/her mental state. In our model, we try to imbibe this approach by fusing the 3 modalities i.e. word context, audio, and video and predict an output regarding the mental health of the patient. The output is divided into different levels to take into consideration the level of depression of the subject. We’ve built a deep learning model that fuses these 3 modalities, assigning them appropriate weights, and thus gives an output.

This isn't just about identifying depression; it's about understanding its intricate language, unlocking a door to earlier intervention and more effective treatment. By listening to the whispers of the body and the echoes of the mind, we hope to paint a brighter future for those struggling in the shadows, where early detection becomes the first step towards reclaiming the light.

Approach

SYSTEM OVERVIEW

In my proposed system for depression detection, we initiate the process by extracting features from audio, visual, and textual modalities. These features are integrated based on timestamps, emphasizing their time-dependent interactions. The pre-processing stage involves aligning features at a sentence level to capture contextual nuances between words.

GATING MECHANISM AND FEATURE CONTROL

Following feature extraction, a crucial step involves applying a gating mechanism to regulate the influence of different modalities on the final output. Weight vectors are introduced for each modality, enabling the system to learn and control the transformation of information carried forward. At each time step, feature vectors from each modality are concatenated and passed through a word-level LSTM (Long Short-Term Memory) equipped with the gating mechanism. Before concatenation, audio and visual vectors undergo additional gating mechanisms to extract essential information.

HYBRID FUSION FOR ENHANCED PERFORMANCE

An alternative approach incorporates a hybrid fusion technique, strategically combining early and late fusion benefits. This can occur at one or two levels, with feature fusion initially creating a new modality. This newly formed modality is then treated as an additional individual modality, and its scores or decisions are fused with those of the original modalities. This hybrid fusion strategy aims to optimize information integration for improved model performance in depression detection.

Procedure

1️⃣ Pre-install all the required libraries

   1) nltk
   2) numpy
   3) pandas
   4) gensim
   5) gc
   6) keras
   7) smart_open
   8) sklearn
   9) matplotlib
  10) tensorflow

2️⃣ Understand the dataset

The Dataset was never downloaded locally due to its large size and other computational limitations.The code snippets in Dataset.ipynb were used to obtain from the DAIC server, unzip them and arrange them in a manner we saw fit for easy implementation. The DAIC-WOZ dataset, curated by the University of Southern California, is a subset of the broader DAIC (Distress Analysis Interview Corpus) aimed at aiding the diagnosis of psychological distress conditions, including anxiety, depression, and PTSD. Comprising clinical interviews, the Dataset features audio and video recordings and extensive questionnaire responses. The Wizard-Of-Oz interviews, conducted by an animated virtual assistant named Ellie, are also included, where a human interviewer controls Ellie in a separate room. The data is meticulously transcribed and annotated, encompassing verbal and non-verbal features.

3️⃣ Data preprocessing

VIDEO MODALITY : The Dataset incorporates 388 features extracted from facial expressions, including 68 2D and 68 3D facial points, 24 AU features measuring facial activity, 16 features for gaze representation, and 10 features for pose representation.

AUDIO MODALITY : In the audio modality, features are sampled at 100Hz, offering insights every 10ms. Encompassing 12 Mel�frequency cepstral coefficients (MFCCs), pitch tracking features, and parameters like F0, VUV, NAQ, QOQ, H1H2, PSP, MDQ, peakSlope, Rd, Rdconf, MCEP0-24, HMPDM0-24, and HMPDD0-12, theaudio features are comprehensive.

TEXT MODALITY : The textual modality consists of a conversation transcript in CSV format, timestamped at the sentence level and classified by the speaker. Expressions like laughter are annotated, and the Dataset comprises 189 sessions, varying from 7 to 33 minutes, featuring interviews with 59 depressed and 130 non-depressed subjects.

4️⃣ Build and train the model

The audio and video features are first passed through 3 feedforward highway layers. Then, dense layers are used to reduce the dimensionality of both video and text features. After concatenation, LSTM with 128 hidden nodes is used. Finally, a dense layer is applied with sigmoid activation to get the output. A learning rate of 0.0001 is used. An EarlyStopping callback is used from Keras API for the number of epochs. The audio and video features are first passed through 3 feedforward highway layers. Then, dense layers are used to reduce the dimensionality of both video and text features. After concatenation, (Bi)LSTM with 128 hidden nodes is used. Finally, a dense layer is applied with sigmoid activation to get the output. A learning rate of 0.0001 is used. For the number of epochs, the EarlyStopping callback is used from Keras API.

5️⃣ Train the model using Intel OneAPI to get better results

How does OneApi provide better performance :

Today’s computer systems are heterogeneous and include CPUs, GPUs, FPGAs, and other accelerators. The different architectures exhibit varied characteristics that can be matched to specific workloads for the best performance. Having multiple types of compute architectures leads to different programming and optimization needs. oneAPI and SYCL provide a programming model, whether through direct programming or libraries, that can be utilized to develop software tailored to each of the architectures.

Advantages of using OneAPI :

We can use Single code for both CPU and GPU (heterogeneous computing)
To implement machine learning based IoT projects easily with less hardwares as the machine learning part happens in cloud
To process files faster ie. it takes less time to run the epochs
OneAPI allows users to transcend Hardware restrictions and provide better performance for low powered computers
Accuracy will improve while using OneAPI

To migrate your project to OneAPI : click here! to get started

For reference : click here!

6️⃣ Save the model

   save the model to calculate the accuracy and loss

Accuracy and Loss

Conclusion

In conclusion, our presented model offers a robust approach to detecting depression by leveraging audio, video, and lexical indicators. By implementing a sentence-level model with highway layers as a gating mechanism, our results indicate that this approach outperforms alternative models. Incorporating a hybrid fusion technique, combining early and late fusion, enhances the interpretability of each modality, contributing to the overall effectiveness of our model.There is potential for further refinement in feature extraction, focusing on additional audio parameters such as response time, pauses, and silence rate. Exploring the interaction between bodily action sequences from motion capture data and verbal behaviour could provide a more comprehensive understanding of depressive symptoms. Our model lays the foundation for future advancements, suggesting avenues for in-depth exploration and improvement in depression detection methodologies.

gangeshbaskerr/Early-Depression-Detection