eren-ck/st_dbscan

Usage of squareform

Closed this issue · 1 comments

Hello, thanks for a nice and straightforward implementation of the ST-DBSCAN algorithm!

I ended up looking at your source and tried to understand it myself. What you essentially did is that you feed a distance matrix that sets the distances that do not meet the temporal eps to doubled the spatial eps, and then call sklearn's DBSCAN on the distance matrix.

For this block of code in st_dbscan.py:

time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric=self.metric))
euc_dist = squareform(pdist(X[:, 1:], metric=self.metric))

# filter the euc_dist matrix using the time_dist
dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)

db = DBSCAN(eps=self.eps1, min_samples=self.min_samples, metric='precomputed')
db.fit(dist)

You called squareform twice to form two square matrices for computed time and spatial distances. And then dist will be a third square matrix that has the same dimension as time_dist and euc_dist. This means you will have three matrices with relatively large size in terms of memory usage. This of course depends on the data size. For my data, they all have > 50000 rows (that results approx. 50000 by 50000 matrices, my data is float64, so each matrix is over 16GB of memory) so the algorithm breaks without processing the data in chunks.

What I would do is this:

time_dist = pdist(X[:, 0].reshape(n, 1), metric=self.metric)
euc_dist = pdist(X[:, 1:], metric=self.metric)

# filter the euc_dist matrix using the time_dist
dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)

db = DBSCAN(eps=self.eps1, min_samples=self.min_samples, metric='precomputed')
db.fit(squareform(dist))

In this case, only one square matrix is needed.

Hello Wang,
thanks for the input. You're absolutely right. I have adapted the code and updated the library on PyPi.

Cheers,
Eren