SeungjunNah/DeepDeblur-PyTorch

Running Demo (MultiSaver) on Windows

mj9 opened this issue · 6 comments

mj9 commented

I spent many hours trying to get this to work under Windows. I managed to get it to work now, so this is probably useful to others.

Setup

The first obstacle is the readline Python package, which seems to be default on Unix systems, but not on Windows. For this, simply install the pyreadline package, which is a Windows port of readline.

Understanding the command-line

Example command: python main.py --save_dir REDS_L1 --demo_input_dir d:/datasets/motion47set/noise_only --demo_output_dir ../results/motion47set.
Explanation: specifying --demo_input_dir (or --demo true) will run an evaluation, using a pretrained model as specified in --save_dir. Every image of my motion47set will be evaluated. The results will be saved alongside the folders src and experiments at the project root, in a folder results/motion47set.
Note that even getting this far is not very intuitive, as others have already pointed out. Usually there is a separate python script for just evaluation/testing/inference. Next, the term demo is a bit unusual, at first I was expecting some interactive demonstration of some form. The save_dir I had at first used as what demo_output_dir does.
Another word of caution, if the output path is given without any ., it somehow ends up saving the results at d:/results/motion47set, which again took me a while to figure out, i.e. on the root of the same drive that the project is located at. I suggest printing out the absolute output dir with os.path.abspath to the user at some point, for clarity.

Bug

Running the above command will produce the following output:

===> Loading demo dataset: Demo
Loading model from ../experiment\REDS_L1\models\model-200.pt
Loading optimizer from ../experiment\REDS_L1\optim\optim-200.pt
Loss function: 1*L1
Metrics: PSNR,SSIM
Loading loss record from ../experiment\REDS_L1\loss.pt
===> Initializing trainer
results are saved in ../results/motion47set
|                                                        | 0/90 [00:00<?, ?it/s]Can't pickle local object 'MultiSaver.begin_background.<locals>.t'
|██▏                                             | 4/90 [00:06<02:14,  1.56s/it]Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "d:\Program Files\Anaconda3\envs\torch gpu\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "d:\Program Files\Anaconda3\envs\torch gpu\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
|█████▋                                         | 11/90 [00:10<01:12,  1.09it/s]forrtl: error (200): program aborting due to control-C event

Also note that ctrl+c takes a really long time to terminate for me, and even slows down my entire machine for several seconds.

This is difficult to debug, because there is no fatal exception, and everything seems to run normally, ignoring the errors, which might also just be warnings, for all we know. I did not realize for a while that MultiSaver is a file of this project, which is why there is not much help online in regards to this error/warning. Second, the only that that gives a little stronger hint that this is an error, and not a warning, is the EOFError, which I still don't know why or where it even happens. A large part of debugging time was me assuming these were just warnings, and trying to fix the command-line arguments instead, since that is easy to get wrong.

What is actually happening is that the MultiSaver code runs clean on the main thread, but then each spawned thread/process will fail, without the main thread being aware. As a result, the program runs through, attempts to save the output images, which all do nothing since the threads/processes already died. I'm not sure how to to achieve this, but it would be nice if the program stops running when it is unable to save output images (at least in demo mode, where that's about the only purpose).

The keywords to locate the actual issue here are pickle and multiprocessing. Going into utils.py and looking at the class MultiSaver shows us a method begin_background, with a method-local variable t (another method). Defining that method works, however (under Windows) that variable has to be pickled/serialized to hand it over to the mp.Process, which will run it in a different thread/process. This fails because pickle does not support local objects.
I tried various ways to change the scope of t:

  • put global t before the definition of t (no change)
  • move t to the outermost scope of the file utils.py, i.e. same level as MutliSaver (can pickle the method, but later fails at a different point)
  • the solution that works is putting t on the same scope as MultiSaver, and annotating it with @staticmethod. The annotation avoids the first method parameter to be used as self.

So my modification looks like this

class MultiSaver():
...

    @staticmethod
    def t(queue):
        ...

    def begin_background(self):
        self.queue = mp.Queue()
        
        worker = lambda: mp.Process(target=MultiSaver.t, args=(self.queue,), daemon=False)
    ...
...

After this change, everything works as expected. I haven't tested it, but I suspect this will still work under Unix as well.
I'm not sure if this will work if multiple instances of MultiSaver are created, and maybe this would give the same result as putting t to the outermost scope, i.e. fail again.

Thanks for your comment!

May I ask you a question? I'm wondering if I can use the trained result like YOLO, e.g.

import python.darknet as dn

dn.set_gpu(0)
net = dn.load_net(str.encode("cfg/tiny-yolo.cfg"),
str.encode("weights/tiny-yolo.weights"), 0)
meta = dn.load_meta(str.encode("cfg/coco.data"))
r = dn.detect(net, meta, str.encode("data/dog.jpg"))
print(r)

@mj9 Thanks for the hard work and time spent. your solution works like a charm on my windows.

Hi. I have been trying to run the demo on Windows following your guideline and when I run the program I get this error:

===> Loading demo dataset: Demo
Loss function: 1*L1
Metrics: PSNR,SSIM
===> Initializing trainer
results are saved in ../results
|                                                      | 0/1000 [00:00<?, ?it/s]name 't' is not defined
|█▉                                           | 42/1000 [01:31<34:40,  2.17s/it]

The time increases in the console but I am not getting any results in the output folder, it only creates empty folders.
It seems that it is related to the t() function in the MultiSaver class. It is not working anymore following your guideline.
Can you look into this? Thank you

mj9 commented

The time increases in the console but I am not getting any results in the output folder, it only creates empty folders. It seems that it is related to the t() function in the MultiSaver class. It is not working anymore following your guideline. Can you look into this? Thank you

I don't think the code in this repository changed much, so I'm guessing my approach still works. Are you sure you implemented it correctly? You could post the code of your MultiSaver implementation

@mj9 Hi, Can you share your utils.py with us? Send the contents of the document directly to the forum.
I've been bothering for days about this problem that running on the window10.Thank you.
My code just like that

class MultiSaver():
    def __init__(self, result_dir=None):
        self.queue = None
        self.process = None
        self.result_dir = result_dir

    def begin_background(self):
        self.queue = mp.Queue()

        @staticmethod
        def t(queue):
            while True:
                if queue.empty():
                    continue
                img, name = queue.get()
                if name:
                    try:
                        basename, ext = os.path.splitext(name)
                        if ext != '.png':
                            name = '{}.png'.format(basename)
                        imageio.imwrite(name, img)
                    except Exception as e:
                        print(e)
                else:
                    return

        worker = lambda: mp.Process(target=MultiSaver.t, args=(self.queue,), daemon=False)
        cpu_count = min(8, mp.cpu_count() - 1)
        self.process = [worker() for _ in range(cpu_count)]
        for p in self.process:
            p.start()

    def end_background(self):
        if self.queue is None:
            return

        for _ in self.process:
            self.queue.put((None, None))

    def join_background(self):
        if self.queue is None:
            return

        while not self.queue.empty():
            time.sleep(0.5)

        for p in self.process:
            p.join()

        self.queue = None

    def save_image(self, output, save_names, result_dir=None):
        result_dir = result_dir if self.result_dir is None else self.result_dir
        if result_dir is None:
            raise Exception('no result dir specified!')

        if self.queue is None:
            try:
                self.begin_background()
            except Exception as e:
                print(e)
                return

        # assume NCHW format
        if output.ndim == 2:
            output = output.expand([1, 1] + list(output.shape))
        elif output.ndim == 3:
            output = output.expand([1] + list(output.shape))

        for output_img, save_name in zip(output, save_names):
            # assume image range [0, 255]
            output_img = output_img.add_(0.5).clamp_(0, 255).permute(1, 2, 0).to('cpu', torch.uint8).numpy()

            save_name = os.path.join(result_dir, save_name)
            save_dir = os.path.dirname(save_name)
            os.makedirs(save_dir, exist_ok=True)

            self.queue.put((output_img, save_name))

        return

And my running text is
image

mj9 commented

@mj9 Hi, Can you share your utils.py with us?

I don't have access to the code currently, but note how your method t is inside another function. You have to put it directly under MultiSaver