qcri/DeepBlocker

When I reproduced the AutoEncoder model useing main.py and structured Amazon-Google Dataset, the recall rate I got is 94.4% .I don't know how to get 97.1% in Amazon-Google row DL column in table 6 in this paper.

SovereignLin opened this issue · 0 comments

    When I ran the main.py at https://github.com/saravanan-thirumuruganathan/DeepBlocker and used the structured Amazon-Google Dataset downloaded from https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md#fodors-zagats,the recall rate I had got was 94.4% using  the model called AutoEncoder.But the recall rate in the paper is 97.1% which can be found in the Amazon-Google row,DL column in table 6.
    I used the 'wiki.en.bin' in https://fasttext.cc/docs/en/pretrained-vectors.html, and I changed the activation function from ReLU to Tanh mentioned in the Autoencoder in Section 3.4.
    The configuration is :

FASTTEXT_EMBEDDIG_PATH = "embedding/wiki.en.bin"
#Dimension of the word embeddings.
EMB_DIMENSION_SIZE = 300
#Embedding size of AutoEncoder embedding
AE_EMB_DIMENSION_SIZE = 150
NUM_EPOCHS = 100
BATCH_SIZE = 256
RANDOM_SEED = 1234
LEARNING_RATE = 1e-3
K=50
And the aggregator used SIF.
So I don't know how to reproduce the 97.1% recall rate in sturctured Amazon-Google.
In Conclusion there are I few problems:
1、The structured Amazon-Google Dataset is used the raw unprocedded dataset(4 attributes called title、description、manufacturer、price) or the processed dataset(3 attributes called title、manufacturer、price) at https://github.com/saravanan-thirumuruganathan/DeepBlocker ?I think the structured dataset can not have the attribute description,but if that, the structured Amazon-Google Dataset only have 3 attribute but there are 4 in the table 4 in Section 5 in this paper.
2、The two-layer feed-forward NNs of Encoder is 300 - 300 -150 and two-layer feed-forward NNs of Decoder is 150-300-300 in AutoEncoder Model?
3、Is there anything else that needs to be changed in the code to achieve the 97.1 recall rate of the structured Amazon-Google Dataset?