Suggestion - refactor(Binseg): use priority queue
Lucas-Prates opened this issue · 2 comments
Hello again!
I was looking at the implementation of Binseg class, more specifically the fitting part of the _seg method:
bkps = [self.n_samples]
stop = False
while not stop:
stop = True
new_bkps = [
self.single_bkp(start, end) for start, end in pairwise([0] + bkps)
]
bkp, gain = max(new_bkps, key=lambda x: x[1])
if bkp is None: # all possible configuration have been explored.
break
if n_bkps is not None:
if len(bkps) - 1 < n_bkps:
stop = False
elif pen is not None:
if gain > pen:
stop = False
elif epsilon is not None:
error = self.cost.sum_of_costs(bkps)
if error > epsilon:
stop = False
if not stop:
bkps.append(bkp)
bkps.sort()
At first glance, it looked like it could be improved a bit. For instance, avoiding recreating new_bkps, avoid using pairwise, avoid resorting bkps and avoid recomputing self.single_bkp on already computed indices. For the last one, however, the decorator @lru_cache already seems to do a great job avoid recomputation.
I tried to address those points using a priority queue (via min-heap from heapq module), as follows:
bkps = [self.n_samples]
start, end = 0, self.n_samples
bkp, gain = self.single_bkp(start, end)
node = (-gain, start, bkp, end)
bkp_queue = [node]
heapq.heapify(bkp_queue) # min heap
stop = False
while not stop:
stop = True
gain, start, bkp, end = heapq.heappop(bkp_queue)
gain = -gain
if bkp is None: # all possible configuration have been explored.
break
if n_bkps is not None:
if len(bkps) - 1 < n_bkps:
stop = False
elif pen is not None:
if gain > pen:
stop = False
elif epsilon is not None:
error = self.cost.sum_of_costs(bkps)
if error > epsilon:
stop = False
if not stop:
bkps.append(bkp)
# searches (start, bkp) and pushes to queue
node_bkp, node_gain = self.single_bkp(start, bkp)
node = (-node_gain, start, node_bkp, bkp)
heapq.heappush(bkp_queue, node)
# searches (bkp, end) and pushes to queue
node_bkp, node_gain = self.single_bkp(bkp, end)
node = (-node_gain, bkp, node_bkp, end)
heapq.heappush(bkp_queue, node)
bkps.sort()
This implementation also avoids recomputation without using @lru_cache. I tested it, and the output is exactly the same from the previous implementation. However, there was almost no performance gain. I did a simulation study with small sample sizes, also varying the number of breaks (20%, 40%, 60%, 80% of the sample size), and the figure below show the results.
Hence, I did not not provide a PR since there is no clear advantage of using this implementation, and it concerns a very important part of the package. Maybe, for large datasets, there is advantage since cache misses might start to become more common. However, I know little on the subject and my pc does not allow me to perform simulations with huge datasets.
Thank you!
Thanks again!
This seems like a good idea. We'll look into it shortly
Closing for now.