Results on MSNWEB30K

Question

Results on MSNWEB30K

CosimoRulli opened this issue 4 years ago · 8 comments

Hello developers,
I was willing to reproduce your results using a self-attentive model on msn30k. In particular, I was interested in reproducing the 0.5431 of ndcg@10 that you obtain with Ordinal Loss (Table 3 of your paper).
This is my config.json, based on the hyperparameters specified in the article.

    "model": {
        "fc_model": {
            "sizes": [144],
            "input_norm": "True",
            "activation": null,
            "dropout": 0.0
        },
        "transformer": {
            "N": 4,
            "d_ff": 512,
            "h": 2,
            "positional_encoding": null,
            "dropout": 0.4
        },
        "post_model": {
            "output_activation": "Sigmoid",
            "d_output": 4
        }
    },
    "data": {
        "path": "./Fold1/",
        "validation_ds_role": "vali",
        "num_workers": 16,
        "batch_size": 64,
        "slate_length": 240
    },
    "optimizer": {
        "name": "Adam",
        "args": {
            "lr": 0.001
        }
    },
    "lr_scheduler": {
        "name": "StepLR",
        "args": {
            "step_size": 50
        }
    },
    "training": {
        "epochs": 100,
        "gradient_clipping_norm": [],
        "early_stopping_patience": 10
    },
    "metrics": ["ndcg_10"],
    "loss": {
        "name": "ordinal",
        "args": {
            "n": 4
        }
    },
    "val_metric": "ndcg_10",
    "detect_anomaly": "True"
}

The best result I can get is 0.4307 of ndcg@10 on the validation set. What am I missing?

Answer 1 · 2021-02-08T11:47:32.000Z

Hello,
Have you tried standardising the input features first?

Answer 2 · 2021-02-12T16:10:26.000Z

Hi, thank you for the answer and sorry for the late reply. By standardizing the features I got closer to your results, with an ndcg@10 of 0.5388 on the test set. The results of the paper are still slightly higher. Maybe my architecture is different from yours. Did you use layer normalization in the input layer?

Answer 3 · 2021-02-15T12:46:19.000Z

You should be able to replicate the result if you turn off the normalization at the input layer and turn off early stopping (e.g. by setting it to 100). If there are any further problems, we will investigate it.

Also, please remember that the reported results are the average (+ std.dev) over 5 folds.

Answer 4 · 2021-03-01T08:36:48.000Z

Hi, I finally got 0.5208 of ndcg@10 on the Fold1 of MSN30k by removing the early stopping criterion. I think that this result matches the one in your paper.

Answer 5 · 2021-03-30T12:07:58.000Z

There was one more issue with reproducibility. In our experiments we used an internal allRank fork and missed one detail in the GitHub version - "filler" NDCG value to use when there are no relevant items in the list. LightGBM and XGBoost (AFAIK) use 1.0 so we used this value in our experiments. However, the released code contained 0.0 filler NDCG.

The filler NDCG has been changed to 1.0, as of version 1.4.1. We are also working on a WEB30K reproducibility guide for both papers.

Answer 6 · 2021-03-30T12:36:23.000Z

Thank you for your answer. Anyway, I did change the filler NDCG to 1.0 during my experiments, hence my results shall be compared with the ones in your paper. Can you share with me your NDCG@10 score on the Fold1 of WEB30K?

Answer 7 · 2021-04-09T16:58:03.000Z

Thank you for your answer. Anyway, I did change the filler NDCG to 1.0 during my experiments, hence my results shall be compared with the ones in your paper. Can you share with me your NDCG@10 score on the Fold1 of WEB30K?

excuse me , does standardizing the features need to be implemented by myself?

Answer 8 · 2021-04-10T08:55:41.000Z

Thank you for your answer. Anyway, I did change the filler NDCG to 1.0 during my experiments, hence my results shall be compared with the ones in your paper. Can you share with me your NDCG@10 score on the Fold1 of WEB30K?

excuse me , does standardizing the features need to be implemented by myself?

I think yes, or at least this is what I did. I standardized the features offline before launching the training script.