sophos/SOREL-20M

Binaries directory doesn't contain legitimate files

Closed this issue · 2 comments

As you mentioned there are ~20M hashes at meta.db with their classification, although the binaries directory at AWS contains only ~10M files of malwares only.
It would be great if you could upload also the legitimate binaries to another directory so we could have a balanced data set.

Thanks in advance for you awesome job!

Thank you for your feedback!

Unfortunately, due to potential intellectual property issues, we cannot release the binaries for the 10M "benign" files. We realize that this is a drawback to the data set, but we haven't been able to find a good workaround for it yet.

All files, including the benign ones, are available by the sha256 in the meta.db from ReversingLabs. We have heard (but not confirmed) that the majority of the benign files are also available from VirusTotal. The best solution we can offer at the moment is obtaining benign files directly from one of these two sources.

We're actively working on trying to find another source of benign files that we can share without this problem so that we can offer a balanced data set, and hope to include this in a future release.

Thank you for your feedback!

Unfortunately, due to potential intellectual property issues, we cannot release the binaries for the 10M "benign" files. We realize that this is a drawback to the data set, but we haven't been able to find a good workaround for it yet.

All files, including the benign ones, are available by the sha256 in the meta.db from ReversingLabs. We have heard (but not confirmed) that the majority of the benign files are also available from VirusTotal. The best solution we can offer at the moment is obtaining benign files directly from one of these two sources.

We're actively working on trying to find another source of benign files that we can share without this problem so that we can offer a balanced data set, and hope to include this in a future release.

Thank you for your feedback!

Unfortunately, due to potential intellectual property issues, we cannot release the binaries for the 10M "benign" files. We realize that this is a drawback to the data set, but we haven't been able to find a good workaround for it yet.

All files, including the benign ones, are available by the sha256 in the meta.db from ReversingLabs. We have heard (but not confirmed) that the majority of the benign files are also available from VirusTotal. The best solution we can offer at the moment is obtaining benign files directly from one of these two sources.

We're actively working on trying to find another source of benign files that we can share without this problem so that we can offer a balanced data set, and hope to include this in a future release.

Fine, looking 4ward 2 it !
but is there any suggestion or principle I should know if I wanna collect some benign binaries by myself for testing this work, ensuring the results I get will be more objective?