It has been demonstrated that the amount of data is crucial in data-driven machine learning methods. Data is always valuable, but in some tasks, it is almost like gold. This occurs in tasks where data is scarce or very expensive to obtain, such as predictive maintenance, where faults are rare. In this context, a mechanism to generate synthetic data is very useful.While fields such as Computer Vision or Natural Language Processing have extensively explored synthetic data generation with promising results, other domains like time-series have received less attention. This work specifically focuses on studying and analyzing the use of different techniques for data augmentation in time-series for classification and regression problems. The proposed approach involves the use of diffusion probabilistic models, which have recently achieved successful results in the field of Image Processing, for data augmentation in time-series. Additionally, it suggests the use of a set of meta-attributes to condition the data augmentation process. The results highlight the high utility of this methodology in creating synthetic data to train classification and regression models. To contrast the results, six different datasets from diverse domains were employed, showcasing versatility in terms of input size and output types.
This repository contains all the source code needed to reproduce the experiments or review the results obtained (results folder).
After clone this repository create a virtual enviroment with conda or virtualenv and install the requirements:
pip install -r requirements
You are ready to start executing commands.
The structure of this repository is organized as follows:
d3a-ts/
│
├─ data/ # Links to datasets used in this work
│
├─ results/ # Results files, logs, and configuration of the experiments
│
+─ src/
│
├─ d3a
│ │
│ ├─ meta.py # Meta attribute generation
│ │
│ ├─ net.py # Implementation of the diffusion and autoencoders for data augmentation
│ │
│ ├─ search.py # Command to execute the experiments
│ │
│ +─ notebooks
│ │
│ ├─ Figures.ipynb # Creation of some figures used in the paper
│ │
│ ├─ Results review.ipynb # Creation of graph results included in the paper
│ │
│ +─ bayesiantests.py # Code to execute the Bayesian tests
│
├─ data # Code to load and create the data generations of each dataset
│
+─ nets # Network architectures used in the paper
To run the experiments, utilize the search.py command:
usage: search.py \[-h\] -d DIR -g {0,1}
optional arguments:
-h, --help show this help message and exit
-d DIR, --dir DIR Directory where found the params file
-g {0,1}, --gpu {0,1}
GPU to use \[0, 1\]
The command requires a directory path where the params.json file is located, defining the parameters for the experiments. For instance:
python d3a/search.py -d ../results/pronostia/128/nr_cond_bilstm_experiment_1 -g 0
In this example, the command uses GPU 0 and expects to find the params.json file in the directory ../results/pronostia/128/nr_cond_bilstm_experiment_1.
If the directory already contains result files, it will load the results without training the corresponding model. To train new models, create a new directory containing only the params.json file.
The params.json file follows the structure outlined below:
{
"model": "\[bilstm|mscnn\]",
"learning_rate": \[float\],
"name": "\[experiment name\]",
"package": "data.\[data package name\]",
"generator": "\[generator class name\]",
"save_memory": \[true|false\],
"net_config": {
"net": \[dict\],
"noise_rates": \[list\]
},
"denoising_net": \[dict\],
"s2a_net": \[dict\]
}
This file defines the architecture used to train the final classifier or regressor (model and net_config), as well as how to read the data (package and generator). It specifies the denoising model architecture (denoising_net) and the architecture of the network to estimate the attributes (s2a_net).
This work involves generating meta-attributes
To manage the complexity, the meta-attributes vectors
The model $\mathcal{A}{\psi}$ is utilized during the training of the denoising model $\mathcal{H}{\phi}$ to remove noise from a noisy sample. The architecture of
In the final step, the model
The graph illustrates the training process. $\mathcal{A}{\psi}$ represents the network used to predict the meta-attributes vector $\overline{a}$ from the training raw data. $\mathcal{M}$ denotes the process that introduces normal noise to the raw samples, while $\mathcal{H}{\phi}$ represents the denoising network responsible for generating the synthetic samples. The model
The table compares the best mean results obtained by applying denoising conditioned data augmentation with the mean performance of models trained using raw data. The number in brackets refers to the number of denoising steps applied.
Dataset | Net | Raw | AE | DPM |
---|---|---|---|---|
ecg5k | bilstm | 0.5211 ± 0.1178 | 0.3609 ± 0.0157 [1] | 0.3235 ± 0.0085 [1] |
ecg5k | mscnn | 0.7591 ± 0.3981 | 0.5204 ± 0.0325 [3] | 0.4272 ± 0.0205 [1] |
human_activity | bilstm | 1.2370 ± 0.2486 | 1.1023 ± 0.0016 [1] | 1.0897 ± 0.0011 [2] |
human_activity | mscnn | 1.2764 ± 0.1213 | 1.2015 ± 0.0314 [1] | 1.2758 ± 0.0120 [2] |
ncmapss | bilstm | 252.9270 ± 13.9745 | 242.1542 ± 0.0000 [2] | 246.7824 ± 16.9870 [2] |
ncmapss | mscnn | 459.5337 ± 165.5178 | 324.1614 ± 16.9442 [2] | 266.0854 ± 13.5774 [1] |
pronostia | bilstm | 0.0720 ± 0.0063 | 0.0648 ± 0.0030 [3] | 0.0662 ± 0.0007 [1] |
pronostia | mscnn | 0.0614 ± 0.0044 | 0.0487 ± 0.0030 [1] | 0.0522 ± 0.0029 [3] |
shares | bilstm | 0.3480 ± 0.0266 | 0.3435 ± 0.0142 [3] | 0.2947 ± 0.0495 [1] |
shares | mscnn | 1.2505 ± 0.8296 | 0.3113 ± 0.0157 [3] | 0.2153 ± 0.0175 [3] |
wine | bilstm | 0.4097 ± 0.0203 | 0.3850 ± 0.0074 [3] | 0.3432 ± 0.0043 [1] |
wine | mscnn | 1.3288 ± 0.5098 | 0.3298 ± 0.0269 [2] | 0.4529 ± 0.0000 [2] |
The Bayesian signed-rank test provides confirmation that utilizing Diffusion Probabilistic Models (DPM) conditioned with the proposed meta-attributes in this study is highly beneficial for data augmentation in time-series, particularly in the contexts of both classification and regression:
This work has been supported by Grant PID2019-109152GBI00/AEI/10.13039/501100011033 (Agencia Estatal de Investigacion), Spain and by the Ministry of Science and Education of Spain through the national program "Ayudas para contratos para la formacion de investigadores en empresas (DIN2019)", of State Programme of Science Research and Innovations 2017-2020.