Why use DBSCAN?
Closed this issue · 4 comments
Your work is really awesome!
EM is used in LearnSPN, but DBSCAN is used in this project. I wonder Why? Does it have better performance?
Thank you.
There really isn't a real motive behind this. Our first decision was to use k-means as it was easier to implement. But since k-means is restricted to a fixed number of clusters, I decided to experiment with other dynamic clustering algorithms. DBSCAN felt the easier to implement at the time. I do indeed plan on implementing EM. It would be interesting to compare results from both clustering algorithms. I suspect EM will probably yield better results, as DBSCAN may end up wrongly classifying instances as noise because of poor parameter choosing. As to complexity, there really isn't much difference. Both DBSCAN and EM should run quite fast. The bottleneck in LearnSPN is variable independence. Finding pairwise independence takes a lot of time. GoSPN alleviates this by using an Union-Find heuristic, but it can still take quite some time (if I'm not mistaken, last time I measured variable independence took 90% of GoSPN's run time, as opposed to DBSCAN running 7% of the time). For reference, a subset of Caltech-101 with only 300 images normalized to 150x65 pixels running as a classification job can take up to an hour. Taking into account my last measurement, that's about 4 minutes of DBSCAN vs 56 minutes of variable independence testing! That said, I don't think this should change much when switching to EM.
I'm agree with you. Comparing results from both clustering algorithms is interesting.
As time complexity, I think you must know this project spyn. It implement LearnSPN using numpy, so I guess that it may be faster.
I think testing log likelihood on twenty real-world datasets is a good way to check whether your implementation is right or wrong, better, or worse.
Spyn looks nice. I have my reservations with Python, though. I chose Go for its native concurrency and nice syntax, but mainly because of its performance. GoSPN uses GNU GSL (which is coded in C) for computing chi-square, so I expect the real problem to be iterating through the graph and summing the frequencies. There may be a better solution to the one I implemented, though. Any suggestions and ideas are always welcome, of course.
I'm still working on testing everything. I haven't gotten around to testing on real-world datasets. For now I'm testing on the Caltech and Olivetti dataset, and a custom hand drawn digit dataset.