Time complexity
Closed this issue · 5 comments
I have question about time complexity. I made som experiments and I got O(n log n) with using DBSCAN for clustering and O(n^2) with using Quality Threshold algorithm. I tried the first method on Jain, Flame, Compound and t4.8k and second method on MNIST. Can you explain O(n log n) complexity? Because I think that time complexity for SOM is O(n^2). Thanks
i have same questions, and sps.run_colorsExample() trained slowly~
Hi @lipino0111, could you give me more information on the experiments you ran? Thanks.
Just to be clear, you run them using SOM followed by DBSCAN and QTC to evaluate the time complexity, but you used two different datasets? If that's the case I suggest you be consistent with the datasets when testing different methods or they may not be easily comparable. Also, did you run multiple experiments and get O(nlogn) as average/median or was it just from one case?
The O(n^2) (with n number of samples) complexity is on the SOM part only (see this paper) and still represents the worst case scenario.
The complexity on the clustering part depends on the method you choose, the latest sklearn implementation of DBSCAN is O(nd) where d is the number of neighbours (see their API page).
Regarding the slowness, @fangxu622 this library has a basic implementation of SOM and is not optimized for speed yet. I know there are a number of areas where it could be improved, and it could also be partially parallelized, but I haven't come around to do that yet. You are more than welcome to contribute if you'd like.
I tried SOM followed by DBSCAN on data set Jain, Compound and t4.8k. I run every data set multiple time and I got O(n^2) on Jain and Compound and O(n log n) on Compound. I used average time from all experiments.
I think that it depands on net size because when I used net size 30x30 for Jain which has 373 objects I got O(n log n) and when I used net size 8x8 I got O(n^2).
What happens if you use more datapoints? The O(n^2) term comes from looping over the samples, it's possible that it dominates other components (looping over the nodes or epochs) only when samples >> nodes.