Add random forest image matcher to utilize different image features
KilianB opened this issue · 2 comments
If we have labeled test data we can do better than directly comparing distances to guess if the images are duplicates or not.
With different hashing algorithms focusing on different criteria like color, gradient and frequency we might get better results using a simple technique like random forest.
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
A quick implementation will be added shortly.
Which metric do we want to optimize? true positives? Gini impurity does not work in it's bare form due to the way test cases are generated from labeled images. We end up with highly unbalanced classes.
F1 looks promising at the moment.
Are there any slim random forest implementations available (preferably supporting the C4.5 algorithm)? Everything I have found so far will lead to an explosion of the dependency tree. ...