/PAACDA-Comprehensive-Data-Corruption-Detection-Algorithm

Link to Journal Paper for this codebase : https://ieeexplore.ieee.org/document/10058962

Primary LanguageJupyter Notebook

PAACDA-Comprehensive-Data-Corruption-Detection-Algorithm

LINK TO THE JOURNAL PAPER : https://ieeexplore.ieee.org/document/10058962

This repo was developed as a part of research towards a journal discussing the importance of identifying corrupted data in datasets for effective data analysis and processing with machine learning algorithms. The authors introduce a new algorithm called PAACDA( Proximity based Adamic Adar Corruption Detection Algorithm) for detecting corrupted data in linear and clustered datasets, which outperforms other benchmarks with high accuracy. The article also highlights the limitations of current techniques and suggests avenues for future research in this area.

18 datasets of varying sizes and corruptions were used to demonstrate the impact on the 16 different models with our proposed Proximity based Adamic Adar Corruption Detection Algorithm (PAACDA).Some important results showcasing the supremacy of the PAACDA algorithm with respect to other state of the art algos are discussed below.

Clustered Data

Synthetically generated dataset link :

https://www.dropbox.com/s/w9i9lqev4xqn5tj/FinalCorruptedDataset.csv?dl=0 

1)Small- CORRUPTION: 20% DATASET SIZE: 10,000

image

2)Medium- CORRUPTION: 40% DATASET SIZE: 40,000

image

3)Medium-Large- CORRUPTION: 60% DATASET SIZE: 75,000

image

Linear Data

Synthetically generated dataset link :

https://www.dropbox.com/s/209o3gq0w134w1f/FinalCorruptedDataset2.csv?dl=0

1)Small- CORRUPTION: 20% DATASET SIZE: 10,000

image

2)Medium- CORRUPTION: 40% DATASET SIZE: 40,000

image

3)Medium-Large- CORRUPTION: 60% DATASET SIZE: 75,000

image

Contributors