LouisJenkinsCS/Distributed-Data-Structures

Minimizing Communications

Closed this issue · 10 comments

@e-kayrakli (Mentioning so you are aware of a new thread)

Lack of privatization has a dire impact on not only performance, but development outright. I thought you said you distributions are automatically privatized, but I suppose I misheard you, in that the field itself is not privatized and incurs a communication cost, which was responsible in causing performance to slow to a crawl ( to the point I thought I honestly had a deadlock, until I added a println showing it was working, just very slowly). It turns out as well that config variables inside of a module cannot be accessed without a communication cost either. To fix this I have to expose the internal per-locale queue because accessing each time just is not viable. Privatization is the only way to make this viable and convenient. In fact, I'm not even sure I eliminated everything.

Furthermore, I'd like to know if its possible that performance results can vary based on how active the work queue is? I got some really good results, like the kind that makes you gleeful that you've found the perfect solution... only for it to drop back down to scales-worse-than-before-even-though-there-is-no-communication levels.

NumLocales Op/Sec
1 1689350
2 3012000
4 6033020
8 11998700

The above is phenomenal and shows the scaling that it should be doing since each locale does it's enqueue locally (Note: This is only Enqueue, so no communication needed).

After, after attempting to get 16 locales, I waited for half an hour, started at 8 again to double check, and I get results as low as 744515... this is below the baseline! It makes no sense, but I'm assuming its the fact that its busy. Is there a way to get by CPU time?

Speaking of which, have you ever tried the chplvis tool?

I have seen some unexplainable scalablity issues with network synchronization in the past. So, it is possible that it may be something internal.

I am not well-versed in chplvis. I recommend reading the extensive documentation about it.

In it's place I use CommDiagnostics (which shares the same comm callback backbone as chplvis).

I definitely recommend either of these to understand your communication issues.

As for privatization:

class Foo {
  var domField = {1..10} dmapped Block({1..10});
  var arrField: [domField] int;
}

In this case instances of Foo will not be privatized. However, domField and arrField will be privatized. Otherwise, I would be inclined to say it is a bug or unimplemented feature. It is very ulikely, though.

In this case instances of Foo will not be privatized. However, domField and arrField will be privatized.

Hence, accesses to the fields themselves, not their contents, would incur a communication cost, right? When I ran CommDiagnostics, it shows it in the same location: DistributedQueue.chpl:48.

Each time it incurred a get, and after breaking it down into multiple assignments...

From...

descriptors[getLocaleDescriptorIndex()].localQueue.enqueue(elem);

To...

var idx = getLocaleDescriptorIndex();
var descr = descriptors[idx];
var localQueue = descr.localQueue;
localQueue.enqueue(elem);

It showed that descriptors[idx] was the culprit. Just to make sure I'm following you right by "instances of Foo will not be privatized". This penalty was a major bottleneck.

As well, its making me a bit suspicious of the bad performance of the old strictly-FIFO queue, as it was full of little communications like that (I'm thinking of bringing it back)

Sorry for not replying to this...

var descr = descriptors[idx];

Here; if the descriptors is (was) a field of type block distributed array (or any privatizable) array, the communication that you are seeing is not accessing the descriptor but this. I think that was my point. But maybe this is a moot discussion..

Hm, for the test, we won't have more than one instance of it right? What if we kept a static distribution that maintains the pointer offset to each field? It would require some compiler __primitive hacking to do so but I've already done something similar in Chapel for local pointers. Just to provide a more idealized environment (say after class instance privatization is done). What say you?

Edit: Maybe not pointer offsets, but just clone it statically. Realized that (this + offset) suffers from same issue, the dereference.

Won't solve the problem. Compiler will still read the this...

Only solution would be using a module-scope variable for descriptors instead of a field of a class. It breaks all the nice software design principles, though. One thing that may help you with that is the limited information hiding support in chapel: I think you can have private module variables. (I may be mistaken and this may be limited to module functions)

To build on the above idea...

var privatizeGuard$ = sync bool;
var privatizeDom = { 1 .. numLocales };
// Can be resized... but for now monotonically increasing (never reuse existing index) 
var privatizePerQueueDom = { 1 .. 1 };
var privatizeDomMap = privatizeDom dmapped Cyclic();
var privatizeDomArr = [privatizeDomMap] [privatizePerQueueDom] ClonedFields;

Then when you add a new queue, you can add to privatizePerQueueDom and assign the pointers to the fields of interest.

I agree it breaks part of Chapel's design principles, but overall the current state of Chapel interferes with progress, this would be the ideal performance of the queues (applicable to other data structures). At least then it would be given a higher priority if we got, say, 10 - 100x performance by eliminating any and all excess communication.

I think for now, let's still stick with what Chapel has to offer. Optimizations like this should be in the last stages of the development.

At the same time you can still enforce privatization in an ugly way (but arguably nicer than this) for your queues:

Initialization: https://github.com/e-kayrakli/distributed_list/blob/master/DistributedUList.chpl#L62
Use: https://github.com/e-kayrakli/distributed_list/blob/master/DistributedUList.chpl#L323

Nevermind, requires DSI. I do see some interesting snippets of code inside of it though, I'll make sure to study it for other optimizations. I guess I won't worry about it too much for now. One thing I can confirm however is that by just returning the local queues, the performance (and scalability) increased a lot just by removing one or two extra communications per operation.