flowersteam/explauto

Explaining the logic behing uniform_sensor_testcases?

Opened this issue · 3 comments

I am now looking into the testcases generation. I find the method to generate uniform testcases over the sensory space very appealing. Code is here: https://github.com/flowersteam/explauto/blob/master/explauto/environment/testcase.py

My understanding is that a grid of a given resolution is projected on the sensory space, and that each cell is associated with only one observation from within that cell. My questions concern the resolution parameter:

  • is it the number of cut per dimension?
  • what is the logical behind the automatic calculation of resolution: resolution = max(2, int((1.3*n)**(1.0/len(robot.s_feats)))) ?

I also noticed this: # TODO : change obs only if nearer from center of coo.

From what I understand is that in each cell, the corresponding observation will be the last observation encountered in the _populate process. The todo is to replace that by keeping the closest to the center of the cell?

One effect of the grid system is that you do not always have an observation in each cell, so when you ask for 100 test_cases, you often endup with less.

Another method could consist of using KMeans, with k = number of test_cases to find cell centers. Then find the closest observation from the cluster center. This ensures you get 100 test_cases if you ask for 100.
However, this is not really uniform, yet it is a good approximation for k<<n_samples. And the code already generates 100 times more samples than testcases: observations = uniform_motor_testcases(robot, 100*n).

Below is a small example, data in blue (1000 points), Kmean in red (20 points), selected in green (20 points).
screenshot from 2016-08-09 16 23 15

Here is a comparison between the two methods:

Dataset 1000 points.

Grid: ask for 20 points, got 18.
Selected in magenta (18 points)
screenshot from 2016-08-09 16 45 37

Kmeans: ask for 20 points, got 20.
Kmean in red (20 points), selected in green (20 points)
screenshot from 2016-08-09 16 45 55

There is a pool of point at in the bottom-left corner for failed experiment, so it is normal that a sample is selected there.

Resolution was automatically computed with the formula in post 1, it gave 5 for this. so I guess a 5x5 grid, which is 25 cells, out of which only 18 were populated.
Kmeans does look less uniform.

I think I will stick with the k-means because it ensures n-points. But it is not optimal.

What we really want here is a kind of SOM with a constraint that the vectrice should be of similar length. (Scaling the data between 0 and 1 in each dimension beforehand).