deepcharles/ruptures

Suggestion - refactor(Binseg): use priority queue

Lucas-Prates opened this issue · 2 comments

Hello again!

I was looking at the implementation of Binseg class, more specifically the fitting part of the _seg method:

bkps = [self.n_samples]
stop = False
while not stop:
    stop = True
    new_bkps = [
        self.single_bkp(start, end) for start, end in pairwise([0] + bkps)
    ]
    bkp, gain = max(new_bkps, key=lambda x: x[1])

    if bkp is None:  # all possible configuration have been explored.
        break

    if n_bkps is not None:
        if len(bkps) - 1 < n_bkps:
            stop = False
    elif pen is not None:
        if gain > pen:
            stop = False
    elif epsilon is not None:
        error = self.cost.sum_of_costs(bkps)
        if error > epsilon:
            stop = False

    if not stop:
        bkps.append(bkp)
        bkps.sort()

At first glance, it looked like it could be improved a bit. For instance, avoiding recreating new_bkps, avoid using pairwise, avoid resorting bkps and avoid recomputing self.single_bkp on already computed indices. For the last one, however, the decorator @lru_cache already seems to do a great job avoid recomputation.

I tried to address those points using a priority queue (via min-heap from heapq module), as follows:

bkps = [self.n_samples]
start, end = 0, self.n_samples
bkp, gain = self.single_bkp(start, end)

node = (-gain, start, bkp, end)
bkp_queue = [node]
heapq.heapify(bkp_queue) # min heap
stop = False
while not stop:
    stop = True
    gain, start, bkp, end = heapq.heappop(bkp_queue) 
    gain = -gain

    if bkp is None:  # all possible configuration have been explored.
        break

    if n_bkps is not None:
        if len(bkps) - 1 < n_bkps:
            stop = False
    elif pen is not None:
        if gain > pen:
            stop = False
    elif epsilon is not None:
        error = self.cost.sum_of_costs(bkps)
        if error > epsilon:
            stop = False

    if not stop:
        bkps.append(bkp)
        # searches (start, bkp) and pushes to queue
        node_bkp, node_gain = self.single_bkp(start, bkp)
        node = (-node_gain, start, node_bkp, bkp)
        heapq.heappush(bkp_queue, node)

        # searches (bkp, end) and pushes to queue
        node_bkp, node_gain = self.single_bkp(bkp, end)
        node = (-node_gain, bkp, node_bkp, end)
        heapq.heappush(bkp_queue, node)
bkps.sort()

This implementation also avoids recomputation without using @lru_cache. I tested it, and the output is exactly the same from the previous implementation. However, there was almost no performance gain. I did a simulation study with small sample sizes, also varying the number of breaks (20%, 40%, 60%, 80% of the sample size), and the figure below show the results.

runtime_comparison (copy)

Hence, I did not not provide a PR since there is no clear advantage of using this implementation, and it concerns a very important part of the package. Maybe, for large datasets, there is advantage since cache misses might start to become more common. However, I know little on the subject and my pc does not allow me to perform simulations with huge datasets.

Thank you!

Thanks again!
This seems like a good idea. We'll look into it shortly

Closing for now.