MMASD+: A Novel Dataset for Privacy-Preserving Behavior Analysis of Children with Autism Spectrum Disorder

This is the repository for MMASD+, an enhanced version of the original MMASD dataset. MMASD+ is a groundbreaking dataset and framework designed for the privacy-preserving analysis of behaviors in children with Autism Spectrum Disorder (ASD) during play therapy sessions. Unlike the original MMASD dataset, which lacked the ability to distinguish between therapists and children, MMASD+ addresses this limitation by providing clearer labeling and improved privacy measures, making it ideal for analyzing behaviors in children with ASD.

Dataset Overview

The MMASD+ dataset offers three types of data:

3D Skeleton Data: Joint coordinates captured in 3D space using the MediaPipe framework.
3D Body Mesh Data: Detailed 3D body mesh reconstructions using the Regression of Multiple People (ROMP) method.
Optical Flow Data: Motion information derived from video sequences using the Farneback algorithm.

The MMASD+ dataset covers a diverse array of 11 human actions. A detailed list of these action types, along with the number of distinct videos for each, is provided in Table 1.

🔢 S.NO	🎬 Action Class	🎥 No. Videos
1	Arm Swing	105
2	Body Swing	119
3	Chest Expansion	114
4	Drumming	168
5	Sing and Clap	113
6	Twist Pose	120
7	Tree Pose	129
8	Frog Pose	113
9	Squat Pose	101
10	Maracas Forward Shaking	103
11	Maracas Shaking	130

Table 1: The different actions performed during therapy sessions along with the number of videos available for each action class.

Privacy Features Exraction:

Before extracting privacy features, the raw video recordings undergo several pre-processing steps. First, individual persons are isolated from the video using a combination of YOLOv8 and Deep SORT algorithms. Subsequently, privacy features for each person, which preserve the essential movement characteristics, are extracted. The detailed steps of this feature extraction process are illustrated in Figure 1.

Flowchart of Data Processing (don't forget to change ROMP image)

Figure 1: Flowchart of Data Processing

Multimodal Tranformer Framework

To assess the effectiveness of the MMASD+ dataset, we evaluated it on two independent tasks: action classification and ASD identification. For these tasks, we tested various combinations of datasets and frameworks, as outlined in Table 2. The results for both action classification and ASD identification are presented in Table 3.

S.NO	Data Combination	Name	ML Frameworks
1	SCD	A1	LSTM
2	RVD	A2	CNN, ViViT
3	OFD	A3	CNN, ViViT
4	SCD + RV	A4	LSTM + CNN
5	SCD + OF	A5	LSTM + CNN
6	RVD + OFD	A6	CNN + CNN, ViViT + CNN
7	SCD + RVD + OFD	A7	LSTM + CNN + CNN, LSTM + ViViT + CNN

Table 2: Data Modalities and ML Frameworks

Notes:

SCD: 3D Skeleton Coordinate Data
RVD: ROMP Video Data
OFD: Optical Flow Data

S.No	Data-Model Combination	Action Classification Accuracy	Action Classification F1	ASD Classification Accuracy	ASD Classification F1
1	A1: SCD → LSTM	0.8237	0.81	0.8872	0.8870
2	A2: RVD → 3D-CNN	0.8247	0.8245	0.8709	0.8707
3	A2: RVD → ViViT	0.8976	0.8836	0.9431	0.93
4	A3: OFD → 3D-CNN	0.86	0.8558	0.9271	0.9269
5	A3: OFD → ViViT	0.9024	0.8924	0.9340	0.94
6	A4: SCD → LSTM, RVD → CNN	0.937	0.94	0.9434	0.9431
7	A5: SCD → LSTM, OFD → 3D-CNN	0.9678	0.956	0.9487	0.9484
8	A6: RVD → 3D-CNN, OFD → CNN	0.9671	0.96	0.9210	0.9220
9	A6: RVD → ViViT, OFD → CNN	0.9689	0.9687	0.9502	0.9511
10	A6: RVD → 3D-CNN, OFD → ViViT	0.9505	0.9502	0.9189	0.9185
11	A7: RVD → ViViT, OFD → 3D-CNN, SCD → LSTM	0.968	0.97	0.9531	0.9523
12	A7: RVD → 3D-CNN, OFD → ViViT, SCD → LSTM	0.9564	0.95	0.9364	0.9362

Table 3: Action Classification and ASD identification Results of Various Data-Model Combinations

From Table 3, we observed that the framework RVD → ViViT, OFD → CNN, SCD → LSTM achieved higher accuracy values for both tasks. Consequently, we employed this framework to perform action classification and ASD classification simultaneously within the same combination. Figure 2 illustrates the detailed architecture of the Multimodal Transformer Framework, which demonstrated performance exceeding 95% accuracy for both tasks. This highlights the robustness of the multimodal framework, along with the strength of the privacy-preserving dataset in capturing subtle body movement information.

Figure 2: Multimodal Transformer Framework

Code and Data

The complete code for each data-model combination is available in the code folder, and a dataset is provided in the google drive link https://drive.google.com/drive/folders/1_riwTPPj4UxNqEGdyXqpXzjU7EhlC8N1?usp=sharing.

The data folder is organized into three subfolders:

3D-Skeleton: Includes 3D joint coordinates in CSV format. Each row represents a frame, and each column corresponds to a joint.
ROMP: Contains 3D body mesh coordinates stored in .npy files.
Optical Flow: Contains dense optical flow data.

This is for example for checking

References