Reported train, val, test split sizes vs actual FMA size

Question

Reported train, val, test split sizes vs actual FMA size

raraz15 opened this issue a year ago · 3 comments

First of all, thank you very much for the work, having access to the repository gave me a jump start to the field.

After reading the paper, downloading the dataset from IEEE dataport and running some experiments I noticed an issue.

The paper reports that train, val, test_query, and test_dummy sets are disjoint with corresponding sizes 10,000, 500, 500, and 100,000. When you get the union of these disjoint sets, the total size is 110,000. When you consider the NAFP dataset to be a subset of the FMA dataset, this can not be true as FMA dataset (fma_full) contains 106,574 tracks. You can check this information at the FMA Github Repository.
In fact, the IEEE dataport downloaded neural-audio-fp-dataset/music/test-dummy-db-100k-full/fma_full directory contains 93,458 wav tracks. This can be verified with running find -type f -name "*.wav" | wc -l from the directory mentioned previously.
Therefore, running the checksum does not match the provided value at this repo.

Could you clarify this, please?

I think this can be the reason why the reported metrics on the paper and the metrics that we obtain by training an Adam N=120 model or evaluating the provided 640_lamb model are inconsistent. See: Issue related.

I did some experiments to find these 6,542 tracks. I will report them on the corresponding issue.

Answer 1 · 2023-11-06T06:48:36.000Z

Hi @raraz15, thank you for your deep inspection. Firstly, the name test-dummy-db-100k refers to roughly 100k, not exactly 100k. As you mentioned, while FMA is composed of about 100K tracks, it falls short of 100K if you exclude the test/validation hold-out. Honestly, I just called it 100k for ease of reference, sorry for the confusion!!

Also, I haven't been able to identify the main cause of the performance improvement. This repo is a reconstruction of the code I used at the time of writing the paper, and there might have been bugs in the data configuration back then. Generally, a DB size increase of 6K songs would not make much of a performance difference, but if those songs happened to partially overlap with the Test-set, performance would drop. So, I think your inference seems plausible.

Answer 2 · 2023-11-26T19:23:46.000Z

Hello, author. Thank you for your remarkable contributions. I would like to ask if you have figured out why the results of the project you provided are better than the ones mentioned in your paper.

Answer 3 · 2023-11-30T06:42:06.000Z

@EnthusiasticcitizenYe Unfortunately, as time has passed, it is now difficult to reproduce the paper results identically, and it is difficult to clearly find the cause of the performance improvement. So if you cite this work for benchmarking, I recommend comparing the official re-implementation results along with the result table in the paper. Sorry for the inconvenience.