rdkit/benchmarking_platform

Where the score is the same, actives will rank higher than inactives

Opened this issue · 0 comments

In the course of using this benchmark, I just recently noticed a small error, regarding the line:

scores[fp].append(sorted(single_score[fp], reverse=True))

...which occurs in several similarly-named Python scripts.

Since single_score[fp] is a tuple of (simscore, id, active/inactive), it does indeed rank first by similarity, but then it ranks by Id, and the actives have Ids with 'A' in them instead of 'D' for the decoys, and so rank higher (when the similarity is the same). However, even just sorting by the similarity is not sufficient to avoid this problem, as Python sort is a stable sort, and the actives are added to the list first, and so will always occur ahead of the decoys. In other words, a random shuffle is needed first. Here is a potential fix:

# random.seed(1) at the top of the file
random.shuffle(single_score[fp])
scores[fp].append(sorted(single_score[fp], reverse=True, key=lambda x:x[0]))