rwightman/efficientdet-pytorch

CocoEvaluator fails when two training jobs are run at the same time

andravin opened this issue · 0 comments

Describe the bug

If two training jobs are run at the same time, eventually they will attempt to evaluate results at the same time. This will cause a crash due to the fact that CocoEvaluator uses a hard-coded temporary file name, ./temp.json.

Writing the file ./temp.json would also fail if the user ran the training script from a read-only filesystem.

To Reproduce
Steps to reproduce the behavior:

  1. Start two training jobs on the same machine in the same directory.
  2. Wait
  3. Crash

Expected behavior
No crash.

A simple fix is to use a unique temporary file, then there will be no conflict. Here is a patch:

From 6ff05c5028657a84b89a86e548258bc9a94bbf74 Mon Sep 17 00:00:00 2001
From: Andrew Lavin <andrew@subdivision.ai>
Date: Sat, 23 Oct 2021 10:34:03 -0700
Subject: [PATCH] Modified CocoEvaluator to dump coco predictions to a unique
 temporary file.

---
 effdet/evaluator.py | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/effdet/evaluator.py b/effdet/evaluator.py
index b923655..366b4e4 100644
--- a/effdet/evaluator.py
+++ b/effdet/evaluator.py
@@ -8,6 +8,8 @@ import numpy as np
 
 from .distributed import synchronize, is_main_process, all_gather_container
 from pycocotools.cocoeval import COCOeval
+from tempfile import NamedTemporaryFile
+import os
 
 # FIXME experimenting with speedups for OpenImages eval, it's slow
 #import pyximport; py_importer, pyx_importer = pyximport.install(pyimport=True)
@@ -100,8 +102,10 @@ class CocoEvaluator(Evaluator):
         if not self.distributed or dist.get_rank() == 0:
             assert len(self.predictions)
             coco_predictions, coco_ids = self._coco_predictions()
-            json.dump(coco_predictions, open('./temp.json', 'w'), indent=4)
-            results = self.coco_api.loadRes('./temp.json')
+            with NamedTemporaryFile(prefix='coco_', suffix='.json', delete=False, mode='w') as tmpfile:
+                json.dump(coco_predictions, tmpfile, indent=4)
+            results = self.coco_api.loadRes(tmpfile.name)
+            os.unlink(tmpfile.name)
             coco_eval = COCOeval(self.coco_api, results, 'bbox')
             coco_eval.params.imgIds = coco_ids  # score only ids we've used
             coco_eval.evaluate()
-- 
2.17.1