/De_Duplicate

Data Deduplication for Data Cleaning

Primary LanguageC++GNU General Public License v3.0GPL-3.0

De_Duplicate: De-Duplicate the Same Objects in Datasets

Introduction

This package provides an algorithm de_duplicate to deduplicate the datasets where there are multiple exactly same data objects in it. Here we implement perfect hashing (a two-level universal hashing) for de-duplication. Now this package suppports two types of data: integer and real values.

Compilation

The package requires at least g++ with c++11 support. To download and compile the code, type:

$ git clone https://github.com/HuangQiang/De_Duplicate.git
$ cd De_Duplicate
$ g++ -std=c++11 -w -O3 -o dedup de_duplicate.cc 

Run Example

We provide a script run.sh to deduplicate datasets. A quick example of OptDigits.data is shown as follows:

./dedup 3823 64 OptDigits.data OptDigits

where 3823 is the cardinality of OptDigits.data and 64 is dimensionality of OptDigits.data.