a hidden bug in the function named update_frontier_nodes

Question

a hidden bug in the function named update_frontier_nodes

Closed this issue 4 years ago · 3 comments

I think there is a bug in the following function, defined in spark_env/job_dag.py:

def update_frontier_nodes(self, node):
        frontier_nodes_changed = False
        for child in node.child_nodes:
            if child.is_schedulable():
                if child.idx not in self.frontier_nodes:   # bug is here
                    self.frontier_nodes.add(child)
                    frontier_nodes_changed = True
        return frontier_nodes_changed

What self.frontier_nodes stores are the nodes themselves, not their indices. Although this did not have a significant effect on the training results.

Answer 1 · 2020-12-12T16:34:20.000Z

Very nice catch! It's indeed a mistake. Fortunately self.frontier_nodes is a set, so child won't be duplicated. The effect of this bug is that frontier_nodes_changed becomes True more often than intended.

I guess the meta problem for causing these bugs was we didn't properly unit test all the modules. Even this was some research code, I think this level of complexity already requires proper testing. It was an oversight and we should prevent it in next projects.

Answer 2 · 2020-12-13T03:33:08.000Z

Thanks for your reply!

Also, the training procedure takes too much time. I‘ m running on a server with 4 12-core cpus, 256G mem and 2 tesla p40 (48G mem in total). However, one epoch takes approximate 40 secs (it takes less time without gpu) with only one worker. Currently I'm optimizing the code to improve efficiency.

If a more efficient version of Decima is released, please tell me!

Answer 3 · 2020-12-13T14:54:40.000Z

We would suggest training with a smaller problem first, e.g., try reducing num_stream_dags. We didn't optimize the implementation too much at the time and we were just training the final model for a few days. Here is a trained model for comparison: #12.

Let us know if you find a more efficient implementation or figure out which parts of the code are the bottleneck. Feel free to also submit pull requests too. Thanks a lot!