DALTON-Dataset

We present spatiotemporal measurements of air quality from 30 indoor sites over six months during the summer and winter seasons (89.1M samples, totaling 13646 hours of air quality data and 3957 activity annotations from 24 participants among 46 occupants). The sites are geographically located across four regions of type: rural, suburban, and urban, covering the typical low to middle-income population in India. The dataset contains various indoor environments (e.g., studio apartments, classrooms, research laboratories, food canteens, and residential households). Fig. 1 shows the overview of the data collection setup in a typical indoor environment. Our dataset provides the basis for data-driven learning model research aimed at coping with unique pollution patterns in developing countries.

Fig.1: Overview of the field study and data collection with multiple air quality monitors in a typical indoor setup.

Installation

To install the required packages in your python(>=3.11) environment you need to run the below commands:

git clone https://github.com/prasenjit52282/dalton-dataset.git
sudo apt-get update
sudo apt-get install make
cd dalton-dataset
pip install -r requirements.txt

Attributes

We have given comprehensive metadata for all the sensors and their placemant in Metadata folder. The collected air quality and other necessary attributes from each sensor is as shown below.

Parameters	Description
`ts`	Timestamp `yyyy/mm/dd HH:MM:SS` from the ESP32 MCU after reading sensor values
`T`	Temperature reading of the indoor environment in celsius at time ts
`H`	Humidity reading of the indoor environment in percentage at time ts
`PMS1`	Less than 1 micron dust particle readings in parts per million (ppm) at time ts
`PMS2_5`	Less than 2.5 micron dust particle readings in ppm at time ts
`PMS10`	Less than 10 micron dust particle readings in ppm at time ts
`CO2`	Carbon dioxide concentration in ppm at time ts
`NO2`	Nitrogen dioxide concentration in ppm at time ts
`CO`	Carbon monoxide concentration in ppm at time ts
`VoC`	Volatile organic compounds concentration in parts per billion (ppb) at time ts
`C2H5OH`	Ethyl alcohol concentration in ppb at time ts
`ID`	Unique identifier of the deployed sensor (e.g., `41`, `42`, etc.)
`Loc`	Location of the sensor in the indoor environment (e.g., `Kitchen`, `Bedroom`, etc.)
`Customer`	Participant name of the measurement site, replaced with `SiteID` to preserve privacy (`H1`-`H13`, `A1`-`A8`, `R1`-`R5`, `F1`, `F2`, `C1`, `C2`) as per Metadata/Site_wise_details.csv
`Ph`	Phone number of the customer for urgent contact, replaced with `XXXX` to preserve privacy

The raw activities and events (in total 3957 annotations) are stored in the Annotations.csv file of Metadata folder. As annotation may come from different occupants from the same site, we have given unique identifier to each participant (P1 - P46). Each annotation is comprised of the following values.

Parameters	Description
`ts`	Starting timestamp `yyyy/mm/dd HH:MM:SS` of the indoor event or activity
`Label`	Activity or event label (e.g., `Frying fish`, `AC off`, etc.) with detailed description (if possible)
`Site`	`SiteID` of the measurement site that matches with `Customer` in the sensor attributed table
`Customer`	Unique participant identifier (`P1`-`P46`) as per Metadata/Occupants.csv

The annotations can be associated with the sensor readings of any site to analyse the impact of indoor events and activities on the air pollution dynamics.

Dataset Preparation

Python scripts

Execute the following commands to preprocess the air quality measurements from raw csv files to the organised and cleaned dataset:

Merge Replicas for a Measurement Site

python merge_replicas.py --customer {SiteID}
Clean and Preprocess for a Measurement Site

python preprocess_data.py --customer {SiteID} --workers #cpus
Mark BreakPoints in the Data for a Measurement Site

python mark_breakpoints.py --customer {SiteID} --workers #cpus [--plot]

For convinence, we have provided the Makefile with the below commands to process the dataset from raw csvs (./Data folder) to processed csvs (./Processed folder). The repository contains all the processed files. However, the raw csvs can be downloaded and placed in the ./Data folder from Raw Data Files if needed.

make preprocess

Preprocessing Steps

Fig.2: Data preprocessing pipeline.

The dataset is cleaned and organised with the above proprocessing pipeline in Fig. 2. Three new columns are computed from the sensor readings as shown in the figure. The utility of the derived columns are as follows:

Valid : A binary (1/0) column that represents whether all the pollutant readings are within measurement range of the sensors and no sensor is faulty.
Valid_CO2 : A binary (1/0) column that represents whether CO2 sensor is working properly, as it frequently get impacted due to electrical surges in the indoor sites.
bkps : A binary column (1/0) that marks change-points in the data. The change-points (or also know as breakpoints) are computed with the Kernel change point detection (KLCPD) algorithm from the ruptures python package.

Each raw file is processed with the above pipeline and stored in the ./Processed folder. Note that the missing segments (> 15 mins) are replaced with zero values according to step(3 & 4) of the pipeline.

Fig.3: Annotation processing pipeline.

The raw annotation file Annotations.csv is cleaned and processed according to the pipeline shown in Fig. 3. The steps perform generic data cleaning and reformatting, anonymization, segregation of combined annotations, and spelling corrections to ensure the correctness and usability of the annotations. The cleaned annotations are available in the Annotations_cleaned.csv file of Metadata folder.

Note: Annotated food items are in local languages in some cases, based on the mother tongue of the annotator. Some english translations are {'bhindi':'ladies finger','dal':'lentils','posto':'poppy seeds','potol':'pointed gourd','roti':'flat bread','sag':'leafy vegetables', ...}

File Structure

The compressed file structure by combining similar file paths with placeholders (i.e., [Site],[ID_Loc], etc.) is shown below. To see the complete file structure please refer to the file_structure.txt file.

.
├── ./Assets
│   ├── ./Assets/Preprocess.png
│   ├── ./Assets/Preprocess_annot.png
│   └── ./Assets/system_diagram.png
├── ./Data                                                               /* Raw Dataset
│   ├── ./Data/A1
│   │   └── ./Data/A1/101_Study_Desk.csv
│   ├── ./Data/H1
│   │   ├── ./Data/H1/41_Kitchen.csv
│   │   ├── ./Data/H1/[ID_Loc].csv                                       /* Files
│   │   └── ./Data/H1/45_Parent_room.csv
│   └── ./Data/[Site]                                                    /* Directories
│       └── ./Data/[Site]/[ID_Loc].csv
├── ./Merged
│   ├── ./Merged/data_A1.csv
│   └── ./Merged/data_[Site].csv                                         
├── ./Processed                                                          /* Processed Dataset
│   ├── ./Processed/A1
│   │   ├── ./Processed/A1/2023_06_10
│   │   │   └── ./Processed/A1/2023_06_10/101_Study_Desk.csv
│   │   ├── ./Processed/A1/[Date]
│   │   │   └── ./Processed/A1/[Date]/[ID_Loc].csv                       
│   │   └── ./Processed/A1/2023_06_16
│   │       └── ./Processed/A1/2023_06_16/101_Study_Desk.csv
│   └── ./Processed/[Site]                                               
│       └── ./Processed/[Site]/[Date]
│           └── ./Processed/[Site]/[Date]/[ID_Loc].csv
├── ./Metadata                                                           /* Metadata
│   ├── ./Metadata/Annotations.csv
│   ├── ./Metadata/Annotations_cleaned.csv
│   ├── ./Metadata/Occupants.csv
│   └── ./Metadata/Site_wise_details.csv
├── ./library
│   ├── ./library/base_metrics.py
│   ├── ./library/breakpoints.py
│   ├── ./library/constants.py
│   ├── ./library/feat.py
│   ├── ./library/__init__.py
│   └── ./library/preprocess.py
├── ./merge_replicas.py
├── ./preprocess_data.py
├── ./mark_breakpoints.py
├── ./compute_feat.py
├── ./file_structure.txt
├── ./merge.sh
├── ./preprocess.sh
├── ./breakpoint.sh
├── ./features.sh
├── ./Makefile
├── ./LICENSE
├── ./README.md
└── ./requirements.txt

565 directories, 1458 files

Dataset Diversity

Site ID	#Dev	Site Area (sqft)	Floor Plan	#F/ #M	Duration (Hrs)	#Samples	Annot	Participants
H1	5	1100	✔️	1/1	772	11402870	✔️	P1 P2
H2	7	1100	✔️	2/2	469	8333689	✔️	P3 P4 P5 P6
H3	3	1000	✔️	1/1	463	4041058	✔️	P7 P8
H4	5	1200	✔️	1/1	2635	24021924	➖	P9 P10
H5	2	1200	✔️	1/1	2634	7395189	➖	P11 P12
H6	5	400	✔️	1/1	218	3188644	✔️	P13 P14
H7	2	400	➖	1/1	366	2306882	✔️	P15 P16
H8	5	1100	➖	2/1	570	8676832	✔️	P1 P17 P18
H9	2	300	➖	1/1	768	3894082	✔️	P19 P20
H10	2	600	➖	2/2	25	70554	➖	P21 P22 P23 P24
H11	2	600	➖	1/2	86	60098	➖	P25 P26 P27
H12	2	216	➖	1/1	178	1054696	✔️	P19 P20
H13	2	216	➖	1/1	127	269824	✔️	P19 P20
A1	1	150	➖	1/0	146	226888	✔️	P28
A2	1	150	➖	0/1	289	193557	➖	P29
A3	1	180	➖	0/1	344	1098827	✔️	P30
A4	1	150	➖	1/0	125	384975	➖	P31
A5	1	150	➖	1/0	1	77	✔️	P32
A6	1	100	➖	0/1	51	154398	✔️	P33
A7	1	150	➖	0/1	55	54741	✔️	P34
A8	1	150	➖	0/1	60	189141	➖	P35
R1	4	522	✔️	1/6	834	6203065	✔️	P36 P37 P38 P39 P40 P41 P42
R2	1	320	✔️	2/2	367	1161570	✔️	P43
R3	1	616	✔️	0/1	243	750745	✔️	P44
R4	4	522	✔️	➖	371	387195	➖	➖
R5	3	600	✔️	➖	179	1583750	➖	➖
F1	1	150	✔️	2/0	450	631193	➖	P46
F2	1	150	✔️	➖	450	631193	➖	➖
C1	1	500	➖	➖	333	590272	➖	➖
C2	1	500	➖	➖	53	158256	➖	➖

The above table summarizes the overall deployment, user participation, and data collection scale across 30 diverse sites spread across four geographic regions in India. The processed dataset is stored in the ./Processed folder. The corresponding activity annotations and metadata are stored in the ./Metadata folder of the repository. Notably, the raw data files can be downloaded from here.

License and Consent

The dataset is free to download and can be used with GNU Affero General Public License for non-commercial purposes. All participants signed forms consenting to the use of collected pollutant measurements and activity labels for non-commercial research purposes. The institute's ethical review committee has approved the field study (Order No: IIT/SRIC/DEAN/2023, Dated July 31, 2023). Moreover, we have made significant efforts to anonymize the participants to preserve privacy while providing the necessary information to encourage future research with the dataset.

Reference

To refer the DALTON-dataset, please cite the following work.

BibTex Reference:

@article{karmakar2024indoor,
  title={Indoor Air Quality Dataset with Activities of Daily Living in Low to Middle-income Communities},
  author={Karmakar, Prasenjit and Pradhan, Swadhin and Chakraborty, Sandip},
  journal={arXiv preprint arXiv:2407.14501},
  year={2024}
}

For questions and general feedback, contact Prasenjit Karmakar.