theochem/Selector

Implementing Gini coefficient in `metric.py`

FanwangM opened this issue · 4 comments

Implementing Gini coefficient in `metric.py`

The related paper is Journal of Computational Chemistry2016,37, 2091–2097

This relates to #4.

Just as a warning me and @PaulWAyers tried to implement this and concluded that the equation is wrong.

I'll look at it again and figure out the right equation and put it here.

The equation given is correct only for the case where the data is uniformly distributed. Then Eq.(1) in this paper is identical to the second expression on the "alternative expressions" list on wikipedia

  • bitstrings
  • lists/arrays of descriptors where the data is centered and normalized.

Then, for each molecule, we have a feature vector (or bitstring) with length L , count(i) where i=0,1,2,...L is the sum of the feature-values for each feature i over all the molecules. I.e.
count(i) = sum( m in molecules) features(m,i)

  1. sort the count vector in increasing order. This is just np.sort(count).
  2. evaluate Eq. (1) with the sorted vectors.

I think it is nice to avoid sorting. Then you can use:
sum(i,j) |count(i)-count(j)| / (2*L**2 * mean(count(:))

This is the third equation in the first line of https://en.wikipedia.org/wiki/Gini_coefficient#Definition and it might be a bit slower than the presorted version, but it does save the work (and code complexity) from the sort.

Just using Eq. (1) is fine though.