/genai_duplicate_detection_paper

Resources and notebooks to accompany the Duplicate Detection using GenAI paper

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Duplicate Detection with GenAI

Resources and notebooks to accompany the 'Duplicate Detection with GenAI' paper available here.

Musicbrainz Datasets

The examples here use the publicly available Musicbrainz 200K dataset. This can be seen at the Liepzig University benchmark datasets for entity resolution webpage.

The actual dataset itself can be downloaded using the following links:

Requirements

You will need to install the packages listed in the requirements.txt file in order to run the example notebooks. You will also need to install the FAISS package. If you're using conda then you can do it like this:

To install the GPU version:

conda install -c conda-forge faiss-gpu

To install the CPU version:

conda install -c conda-forge faiss-cpu

Example Notebooks

There are a number of notebooks that cover the following things:

  • Steps of the proposed method
  • Visualisation of embeddings
  • Visualisation of match groups
  • Evaluate results of experiments
  • Show results of experiments

The steps that are involved in the proposed method consist of the following steps:

  1. Create Embedding Vectors using "Match Sentences"
  2. Create Faiss Index and identify potential duplicate candidate cluster groups

Step 1. Create Embedding Vectors using Match Sentences

This is in the following notebook:

Step 2. Create Faiss Index and Clusters

Use the following notebook to create a Faiss Index and identify potential duplicate candidate cluster groups.

Visualise Embeddings

This is in the following notebook:

Visualise Match Groups

Only the match groups for the Musicbrainz 200K dataset are visualised since that's the dataset that we ran our main experiments with.

Evaluate experimental Results

This is in the following notebook:

Show experimental Results

This is in the following notebook:

DuDe Resources

The Duplicate Detection (DuDe) toolkit is a great resource for testing different strategies for different datasets:

https://hpi.de/naumann/projects/data-integration-data-quality-and-data-cleansing/dude.html

In order to get it to work with the Musicbrainz 200K dataset I found I had to transform the data into a format that it could process. I have provided 2 notebooks that I used to do this. This will create a formatted CSV file that you can use as a dataset input together with a 'goldstandard' dataset that you can use to generate statistics to see how well your strategies are working: