Step size is multiplicative

Question

Step size is multiplicative

okomarov opened this issue 8 years ago · 7 comments

There is a problem with the step size:

N   = 13000;
ppm = ParforProgMon('some title   ', N, floor(N/100),500, 55);
parfor f = 1:N
    pause(0.01)
    ppm.increment();
end

and it keeps going until reaches 13000%.

Answer 1 · 2017-02-21T12:12:18.000Z

Thanks for the bug report, @okomarov. Now fixed in the master branch.

Answer 2 · 2017-02-21T20:26:20.000Z

It doesn't seems to be fixed. Moreover:

Either the Java or the Matlab implementation should take care of that, not the user in the parfor loop.
I don't get the step size < 0, when would you want to use that?
Also, I find that pctRunOnAll() works better than attaching file.

If interested, you can check my master repo for the fixes to the above.

Answer 3 · 2017-02-22T07:36:10.000Z

Hi @okomarov,

It now works for me; the progress window disappears after 100%. What issue do you still have?
The client should take care of the step size. The reason why, is that every call to ppm.increment() makes a network request. Your example makes 13000 network requests in very rapid succession. That may or may not be what a user wants, if the workers are not local.
The description of the stepSize parameter wasn't great in the documentation. I've made it clearer now, that it's the progress added to the bar on each call to ppm.increment().
I don't understand your comment about step size < 0. Where is it suggested to use a step size < 0?
pctRunOnAll() doesn't work at all. What if you don't have login access to the worker, so you can't install parforProgMon? What if it's on the machine, but in a different directory than on the server machine?

Answer 4 · 2017-02-22T14:47:34.000Z

To followup:

It closes the window but the pool is still working. Try to put a pause(0.1) to see it more clearly. In principle, this newVal is still multiplied by the step size, which does not make sense.
The pool does not process indexes in order, so you can get processing at 300 and immediately afterwasrd 400. It is plain wrong. You can see that by putting a disp(f) inside the parfor
pctRunOnAll() works very well on my local pool and it the adopted solution by version 1 and 2 of the same project. I do not have the means to test a pool on a network, although I see no reason why communication between workers should be barred...how would the parfor run in the first place then?

Answer 5 · 2017-02-22T14:55:10.000Z

As I said, I updated the documentation to make the use of `stepSize` more clear. It indicates that a single call to `ppm.increment()` should increment the progress by `stepSize`; not that the progress bar should wait for `stepSize` steps before updating. You’re correct that the order may end up being incorrect. So two increments could be called with an indeterminate number of steps in between. You have the option of using a `stepSize` of 1, which won’t have that problem, with the downside of increased network bandwidth. I know that `pctRunOnAll` works on a local pool. So does my solution, which also works on distributed pools. The issue is that `pctRunOnAll` assumes that the absolute path to the java classes is identical on the worker and client machines. That is certainly not guaranteed for a distributed pool; the worker machines may not even be running the same OS.

Answer 6 · 2017-02-22T15:03:44.000Z

The docs are still quite unclear.

I might see what you mean now, but honestly, you could explain that better. Still, since there is no guarantee of order, you can have a handshake on the final value, close the bar, while the pool is still working.

An example:

4 workers for 10,000 iterations. Usually Matlab splits the indices in 4 sequential groups, i.e. worker one will have indices 1:2,500, group 2 indices 2,501:5000 and so forth. Moreover, usually, they start working in reverse!

A disp(f) inside the parfor:

parfor f = 1:10000
    pause(0.1)
    disp(f)
end

should show something like

Now, if your bar is terminated when value == bar.maximum(), and it's counting in reverse, you can just get a random close when in fact there are still thousands of iterations to go. This happens in my case.

There is no alternative to avoid network handshakes for a correct count.

Answer 7 · 2017-02-22T15:21:05.000Z

You’re right. If it’s a problem to have an early close, use a `stepSize` of 1 and pay the network cost.