Tuning Content Similarity Parameters
Closed this issue · 1 comments
For the graph MovieLens on the discovery vm, even with a score of 0.75 , these pairs are undetected: [('movies2actors', 'actorid', 'actors', 'actorid'), ('movies2actors', 'movieid', 'movies', 'movieid'), ('movies2directors', 'movieid', 'movies', 'movieid'), ('movies2directors', 'directorid', 'directors', 'directorid'), ('u2base', 'movieid', 'movies', 'movieid'), ('u2base', 'userid', 'users', 'userid')]
I have tried these sets of parameters:
LABEL_SIM_THRESHOLD = 0.75
BOOLEAN_SIM_THRESHOLD = 0.75
EMBEDDING_SIM_THRESHOLD = 1.0
and
LABEL_SIM_THRESHOLD = 0.75
BOOLEAN_SIM_THRESHOLD = 0.75
EMBEDDING_SIM_THRESHOLD = 0.75
Do you have any suggestions for how I could tune this to get better results?
The embedding similarity threshold is for now 1) not normalized between 1 and 0, 2) it is a distance threshold not a similarity threshold, i.e. higher values result in more similar columns. Try increasing the threshold to 1.25, 1.5, ... and let us know of your findings.