/Eye-Robot

An experiment with image recognition and evolutionary algorithms

Primary LanguageC#

What is it?

A program for recognising shapes (be they polygons, printed letters or numbers, handwriting, or any other kind of symbol) in a monochrome image.

Why?

For me to get a feel for the sort of techniques an artificially intelligent application might make use of (particularly evolutionary algorithms), and indeed to explore if this idea is even viable.

How does it work?

The software recognises symbols by analysing a subset of pixels distributed throughout the input image, and seeing whether they are set or unset (i.e. black or white). Black pixels score a certain amount, and white pixels score a certain negative amount. These scores are added together, and the total decides how likely it is that a given image contains a particular symbol. In order to do this, we'll want to check the specific subset of pixels that maximises this score.

Basically the software is trying to provide an answer for the following question: for a given shape, which pixels are "significant"? The hope is that if trained to recognise the letter "A" in both serif and sans-serif fonts, the pixels that make up the serifs would be deemed irrelevant in determining whether the shape was a actually an "A" or not.

For example, here is a 32×32px image of the letter 'a':

And here it is magnified:

We'll pick a number of random pixels to look at (highlighted in orange):

And if the pixels we look at are inside the shape, yay, we score some points. If they're outside the shape, boo, we lose some points.

Training the software

This is the process that the software uses to decide which subset of pixels are relevant. This happens before it can recognise a symbol.

Initially, the way I was choosing this subset was by picking a set of pixels at random, as above; and scoring it against the training data (a sample image). This was repeated many many times over, producing many permutations of pixels from within the image. Whichever permutation happened to give the highest score for our training data was the one that was chosen as the model.

The idea was that this:

Would converge to this:

What did I learn?

I had expected that, through sheer brute-forcing randomness, that the training algorithm would eventually converge on a set of pixels that were tightly clustered on the shape being trained (as described above).

This did not happen, although given enough repetitions, it surely would have. The examples below were generated by the software in this way, each using 200,000 rounds, and to me they look like random noise. Perhaps 200,000 tries was just not enough? More investigation needs to be done here. Despite this, these pixel distributions were better than chance at predicting the contents of a different image (though admittedly, these images weren't vastly different-looking than the training data).

The concept I took away from that discovery is that, as with biological evolution, it adapts an organism to whatever works, not necessarily whatever might be optimal.

Where next?

When training the model, selecting the sampled pixels entirely at random did not work as well as I'd hoped.

During training, because each selected subset is completely random, each one is totally different to the last. If we have a high-scoring set of pixels, rather than trying something entirely different, it makes sense to try something that is only a little bit different to see if we can get closer to a solution. I hypothesised that making these sweeping changes from one generation to the next are not helping us to converge on a solution. From this, I've lifted two more ideas from biological processes to make changes more gradual. The first is that mutations in nature are small (not a total re-write from generation to generation as with our purely random pixel distributions); and the second is mixing up aspects of two pixel-sets, as with breeding in nature, in an attempt to mix-up the best bits of both and produce an overall higher-scoring child (except rather than taking half its genes from each parent, it takes half of the pixels that will be sampled from each parent). The idea is some of the child pixel distributions score higher than the parent they were derived from, and those are the ones we select to take forward. There is a third biological analog in the algorithm, which might be considered something like a "plague" - that is, killing off the majority of the population and letting only the strongest survive into the next generation. I baked this idea into the algorithm, though it was by accident (and was done because of memory constraints of having a large population) - I only take the top n from each round into the mutation phase.

The first feature, mutation, I have implemented and provided some analysis here. The second feature, breeding, is not implemented yet. Stay tuned!

Also coming up, some other things I'd like to look at:

  • So far, I'm getting reasonable (but not exceptional) results at recognising the image it was trained on. I'd like to build up a wider library of samples (See /sample-data/ - there are few different 'a' variants, but they are all quite similar). I'd also like to get some handwriting samples being recognised.
  • Performance
    • In principle, the training algorithm is extremely parallelisable as every scorer can compute its value independently of all others. (I have this working now via PLINQ on a single machine).
      • Now that the training is operating in parallel over PLINQ, I'd like to see if I could get this to operate over a cluster of computers, perhaps using something like DryadLINQ or Apache Spark.
    • I could convert the input bitmap to a 2-dimensional boolean array. This would save calling IsPixelSet repeatedly, avoiding the brightness calculation, which would save approximately 4%, according to Visual Studio's profiler (Done. But I need to benchmark it to be sure, and I was working in a coffee shop with dwindling battery. Didn't want to stress the laptop too much). The training algorithm is also a memory pig.
  • Originally, TuningParams was a static bundle of constants. I'd like to at some point randomise these as well and pass varying sets of tuning options to different parts of the program in an attempt to settle on a better training model (you can already see some of this in the code). This is also partly because I chose the constant values by trial-and-error, they may not necessarily be particularly good. Mutation could apply to these params, instead of just the sampled pixels.
  • I'd like to add the ability to save a trained model to disk and reload it.
  • I'd like to add image normalisation (ie correcting input images for skew, sizing, spacing, etc)
    • I began investigating how I might detect skew or rotation. I found this. Looks quite complicated....
  • Further into the future, I'm still pondering how I'd handle training more than one example image for a given symbol
    • Currently, scoring is quite naive, every hit pixel is worth the same points as any other; perhaps we could give each pixel a certain weighting, derived as some sort of probability of that pixel being set or not, based upon the samples seen during training.

Mutation

I've implemented mutation as described above, where the already highly-scoring pixel sets are selected and have a new generation of pixel sets are created by varying the original only slightly. This mutation can be seen being applied here.

Here is a sample of the results captured during training after mutation was added (generated with 25,000 rounds to create the initial generation):

Letter Best improvement Average improvement Worst regression Average regression
a +350 / +25.93% +231 / +17.01% -630 / -44.06% -466 / -34.18%
b +370 / +41.57% +228 / +24.02% -600 / -59.34% -432 / -45.29%
c +370 / +48.53% +216 / +29.09% -630 / -77.14% -411 / -55.27%
d +340 / +30.93% +216 / +20.66% -610 / -61.00% -438 / -41.93%
e +450 / +54.88% +223 / +25.34% -630 / -72.41% -430 / -48.45%
f +370 / +115.63% +199 / +66.13% -500 / -200.00% -367 / -121.40%
Aggregated average +375 / +52.91% +218.84 / +30.375 -600 / -85.67% -424 / -57.75%

So, the mutations are able to produce better scores than the Scorer they are derived from. The question is, does this produce a better result than simply applying more rounds of the random brute-forcing method first implemented to get the initial generation? More analysis to follow!!

Despite that question, a quick glance at the above table shows that that the average improvement in this case is +200 points or thereabout, and the average regression is around -400 or so. These results are fairly repeatable using the sample images checked into the repo. This does seem to suggest that when small mutations are applied to already-high scorers, the good results improve more than the bad ones devolve.

Known or expected weaknesses

Feeling relatively happy with the way the software was recognising a handful of letters, I decided to step it up a bit and add more letters, and also some simple polygons.

The first thing that gave me pause was the fact that some letters, such as the 'm' below are quite chunky and fill the frame:

Whereas something like this 'j' is thin and spindly:

I'm pondering if this might affect accuracy. More investigation and analysis to follow....

The second "uh-oh" moment I spotted was when adding shapes to the sample data. Here we have a circle, a pentagon, and a triangle.

So far so good. Here we have a square:

Hmm. Every pixel is set. No matter what distribution of pixels our algorithm selects to sample, it will get a 100% hit rate.

The final problem I’m noting having added numerous samples, is that training time is taking longer. I already predicted this previously, however as noted then, the algorithm is trivially parallelisable and I wish to experiment with scaling it to multiple machines in a cluster at some point.

Another problem. The algorithm is getting pretty good at recognising the images it was trained on. Getting it to recognise similar images is the next step (I've already done this with the letter 'a' in the sample data, but more needs to be done)