NVlabs/timeloop

Convergence time

tanvisharma opened this issue · 19 comments

I am opening this issue to ask if it is normal for Timeloop to be running for long hours, sometimes forever (say more than 90 hours). I could see that two threads had not terminated their search, so I sent the SIGINT signal. It took around 20-30 minutes for it to move ahead and dump the output files. I tried this on multiple machines so it should not be a hardware issue.

I understand that changing the mapping and constraints files would affect the search time, but creating a separate constraint file for each problem space sounds cumbersome. I could also track the live status of threads, but then it takes longer time to finish because of print statements. Is there a workaround other than SIGINT when it does not terminate for long hours? Has anyone else faced this issue? How can I track the root cause of it?

Unconstrained mapspaces can be insanely large, so it's certainly possible for the mapper to be running for very long durations. But before we start discussing constraints, we should rule out other issues. The fact that quiescing the ongoing evaluations after the SIGINT took 20-30 minutes is unusual.

A few questions:

  • How many levels are there in your hardware topology, and what's the parallelism at each level?
  • Could you double-check that the environment variables to disable extrapolation are not set?
  • How many threads are you running in the mapper?

And a suggestion:

  • Try running timeloop-model on the opt-mapping that the mapper discovered during your prior run. It should have dumped out a timeloop-mapper.map.yaml file, simply feed that to timeloop-model along with the same arch and problem YAMLs. The runtime should tell you whether there's some weird bug or if modeling a single mapping is indeed taking 20-30 minutes.

Thank you for the suggestion! timeloop-model is very fast and takes only few seconds to finish for the other topology but for the one I observed a long convergence time, it took 24m32.614s. Why would the model take such a long time to do energy evaluation?

To answer your questions,

  • I have multiple hardware topologies. One of them has 5 memory levels with parallelism of the order of 1, 2^7, 2^2 and 2^16 respectively (4th level is dummy). The last memory level is associated with an equal number of MAC units with meshX and meshY of the order of 2^15 and 2^10.
  • I am sorry but I did not understand your question. What are the environment variables that disable extrapolation?
  • I am running 16 threads. I was not able to get results initially so I have a very high timeout value. Could the longer time be because of that?
    mapper: optimization-metrics: [ delay, energy] live-status: False num-threads: 16 timeout: 1200000 victory-condition: 1000 algorithm: [linear-pruned] max-permutations-per-if-visit: 100 diagnostics: False

I had tried to put as many constraints as possible and probably that's why it is not able to finish the search when running timeloop-mapper. For most of the problem shapes, it runs fine but for M=1, it might not have been able to find enough mappings which satisfy the constraints. To give an idea of the search space, this is what timeloop prints on the terminal:
image

On a side note, and even other users in my lab have observed this, timeloop-model can not read the timeloop-mapper.map.yaml file if it is not renamed. e.g. if I rename the same mapping to map.yaml and run timeloop-model, it runs fine but otherwise it is not able to recognize the mapping file and gives key not found error.

With those sizes (2^25) I'm not surprised at the runtimes. The bottleneck here is the multicast analysis, which scales quadratically with instance count. I'm sorry -- I don't have an easy solution right now. We are in the process of upgrading Timeloop's internal nest analysis to use a polyhedral library, which may or may not improve things. Maybe it's possible to do an approximate multicast analysis but I'll have to think deeper about its feasibility. Could you perhaps model a smaller hardware slice and then extrapolate the numbers to your final size?

Incidentally, I don't recommend setting num-threads manually unless you have a specific intent in mind. Just leave it unset and the mapper will spawn as many threads as the number of available hardware contexts on the host machine.

timeloop-model ignoring timeloop-mapper.map.yaml is intentional. The idea is that if you're running both mapper and model experiments in the same directory then we don't want the model to accidentally pick up the mapping YAML emitted by the mapper and potentially override a user-specified mapping. It could lead to a ton of user confusion.

Ahh, I see. I was trying to simulate the whole system because I was not sure how the DRAM and other levels' bandwidth gets divided among the individual instances. Even if I do that, it would eliminate the multicast mappings from the search space which may or may not have resulted in a more energy efficient mapping. Do you have any suggestions for addressing this issue?

This was a great help though, I understand why it is taking such a long time. Thanks for the tips on threads as well.

I appreciate your explanation about the timeloop-model ignoring timeloop-mapper.map.yaml. I can understand how this approach can help avoid confusion. However, I was wondering how the model will accidentally pick the mapping YAML if the user has not specified it.

timeloop-model *.yaml

Been there, done that :-)

Ahh, I see. I was trying to simulate the whole system because I was not sure how the DRAM and other levels' bandwidth gets divided among the individual instances. Even if I do that, it would eliminate the multicast mappings from the search space which may or may not have resulted in a more energy efficient mapping. Do you have any suggestions for addressing this issue?

You're absolutely right. You may be able to look at the emitted stats (access counts, tile sizes, etc.) from a smaller run and do some post-processing to figure out how the larger run would have behaved. BTW if you scale your hardware run I recommend scaling the benchmark down proportionately as well, because otherwise Timeloop will do a spatial->temporal transposition of some tiles which may not fit in your buffers.

Constraining the mapspace will also help. It won't change your 24-minute eval cost per mapping, but it will reduce the total number of mappings in the mapspace, increasing the likelihood that the mapper finds something good within a given set of hyperparameters. As an example, if you are running uncompressed (i.e., dense) workloads then a spatial dot-product unit at the innermost level wrapped in a weight-stationary temporal dataflow (as in NVDLA and Simba) is almost universally recognized as a solid building block.

And lastly, any ideas you have on reducing the multicast computation cost (the code is all in a single function), e.g., by sampling and interpolation may be helpful and in fact could be a solid contribution. The caveat here is that, as I said earlier, we are moving to a Polyhedral based NestAnalysis module which is going to shake things up so perhaps it's best to wait until the move happens before spending time on this.

Thank you for the detailed comment. I agree that moving to a smaller run and extrapolating would be useful, hopefully without much loss of generality. Constraining the mapspace more seems like a good plan too.

Regarding the reduction of multicast computation cost, like you said, maybe it's better to wait for the Polyhedral-based NestAnalysis module before diving in. I'll definitely keep an eye on the progress and look forward to checking out possible improvements when the new module is ready. A quick question before closing this: Will this move change the flow of Timeloop infrastructure substantially or only the way it interacts with the NestAnalysis module?

We intend to keep everything fully backwards compatible.

The polyhedral NestAnalysis will set the stage for new capabilities (einsum graphs and fused mappings) that we will add progressively, but existing configurations will continue to work as-is.

In that case, if I contribute on reducing the multicast computation cost in the current code, it should work for the newer version as well.
I will switch to the lighter architecture version for now but will get back to you to discuss the contributions after a month or so.

Thanks again!

Hi, I am hoping to pick up the multicast issue this week and try to see if there is a way to reduce the overhead. Wanted to check first if the switch to polyhedral NestAnalysis library is going to happen soon. Though there have been multiple commits during summer, I am not sure if they are changing the NestAnalysis implementation. @angshuman-parashar

It’s in progress as we speak. There’s an open PR already (you can check out the code). We’re pushing through some prior features before we can pull in the PR, as well as ironing out some corner-case issues with the polyhedral code.

Thank you for directing me to the open PR. Looking forward to the completion of fused-layer analysis integration.

TLDR; The runtimes I mentioned earlier were found with -O0 optimization build. Apologies for that.

Since I did not see direct changes in the ComputeAccurateMulticastedAccesses function in the open PR, I profiled one of my configurations, which was taking a long time, with the old code. I was digging into the function, when I realized I have made a silly mistake while capturing the runtime. I had built the model with --d option and when I changed it, the runtime numbers improved drastically with -O3 optimization (example below for reference). The total time is still dominated by the time spent in the multi-cast function but the overall time is not big now. Do you still think it will be a significant contribution to look into reducing the multicast computation cost?

-O0 build output

		Time taken for ComputeAccurateMulticastedAccesses: 105.445 s
		Time taken for ComputeAccurateMulticastedAccesses: 111.99 s
		Time taken for ComputeAccurateMulticastedAccesses: 33.1802 s
		Time taken for ComputeAccurateMulticastedAccesses: 35.9836 s
		Time taken for ComputeAccurateMulticastedAccesses: 119.602 s
		Time taken for ComputeAccurateMulticastedAccesses: 125.544 s
		Time taken for ComputeAccurateMulticastedAccesses: 38.1954 s
		Time taken for ComputeAccurateMulticastedAccesses: 39.3478 s
		
		real    10m14.852s
		user    10m14.523s
		sys     0m0.199s

-O3 build output

		Time taken for ComputeAccurateMulticastedAccesses: 15.0611 s
		Time taken for ComputeAccurateMulticastedAccesses: 17.888 s
		Time taken for ComputeAccurateMulticastedAccesses: 4.45297 s
		Time taken for ComputeAccurateMulticastedAccesses: 5.83293 s
		Time taken for ComputeAccurateMulticastedAccesses: 20.6225 s
		Time taken for ComputeAccurateMulticastedAccesses: 23.3771 s
		Time taken for ComputeAccurateMulticastedAccesses: 6.75814 s
		Time taken for ComputeAccurateMulticastedAccesses: 7.75921 s
		
		real    1m43.320s
		user    1m42.788s
                    sys     0m0.130s

Hi Tanvi, thanks very much for the update.

Since I did not see direct changes in the ComputeAccurateMulticastedAccesses function in the open PR,

I think this may just be dormant code that @gilbertmike has temporarily left behind for debugging/backwards-compatibility purposes. I believe the polyhedral path uses a completely different approach.

Do you still think it will be a significant contribution to look into reducing the multicast computation cost?

At this very moment I don't think it makes sense to work on it, because (1) it doesn't seem to be a major bottleneck for you any more and (2) we are cautiously betting the bank on the polyhedral path. However, we are seeing some corner cases with complex mappings where the polyhedral analysis is slowing down. So I'm not completely trashing the legacy path immediately, and if it turns out we need to use that path for some corner cases, optimizing multicast analysis may once again become a priority. But it's premature to do that right now.

Hello @angshuman-parashar, you've mentioned that the multicast computation scales quadratically with number of PEs, but are there any known properties of mappings that affect the computation cost of ComputeAccurateMulticastedAccesses for a constant number of PEs? As I understand it is a recursive routine which works on all the factors of a mapping, so possibly number of distinct factors in a mapping could be a reason? I realize that the combinatorial mapping spaces can be incredibly large and it may be hard to point out one/set-of such properties of mappings.

I suppose the computation cost could be different based on mapping properties.

Looking at the method's implementation in v3.0.2, the critical code is between lines 1529 and 1579 (just 51 lines of code!). The logic is looking at the temporal delta at each spatial ID and comparing it to the deltas at all other IDs to find "match sets". But as an extreme example, if the mapping is expressing a full broadcast then the algorithm should complete in O(2N). But in the other extreme, i.e., when the mapping is expressing a full distribution (scatter), it will end up performing N + (N-1) + (N-2) + ... comparisons, which is O(N^2).

To understand the general case I don't think you need to think about the insanely large mapspace. Just think about these delta-sets at each spatial ID. Say there are M match sets, and the number of identical delta-sets in each bucket is K_m, such that the (summation of K_m over M) == (N, the total number of spatial IDs). What is the time complexity in this general case?

I think the algorithm is pretty wasteful and there should be a variety of ways to do this in O(NlogN). E.g., a dumb way could be to just sort the deltas (we'll need to define a cheap less-than operator) and scan them linearly. In addition to algorithmic inefficiency, the implementation also uses potentially expensive STL map calls, so there may be some low-hanging fruit here in terms of optimization. The fact that it's just 51 lines of self-contained code is promising.

All that said, the master branch now has the polyhedral codebase which doesn't use any of this. So while it's worth spending a couple of days' worth of time for a v3.0.3 checkpoint, I don't think it's worth spending any more than that.

Thanks @angshuman-parashar. A few more follow-ups, although I understand the migration to newer analysis model makes the questions less relevant.

So first each spatial instance's delta would be computed, and then each spatial instance's deltas are compared to every other instance's deltas. And the same deltas of two/more instances are referred to as the "match sets"? In such case, am I correctly understanding that one property that could define the computational expense is the multicast factors for each operand? A large multicast factor would mean less match-set comparisons, and vice versa? But there might be other properties which could have additional impact, such as which loops are assigned at temporal levels, and how they relate with the spatial loops, correct?

Also, are there any estimations (or even better, any measurements) that show the possible computational expense reductions due to the newer polyhedral model? If so, how much? Orders or magnitude?

So first each spatial instance's delta would be computed, and then each spatial instance's deltas are compared to every other instance's deltas. And the same deltas of two/more instances are referred to as the "match sets"? In such case, am I correctly understanding that one property that could define the computational expense is the multicast factors for each operand? A large multicast factor would mean less match-set comparisons, and vice versa?

Correct.

But there might be other properties which could have additional impact, such as which loops are assigned at temporal levels, and how they relate with the spatial loops, correct?

Yes, temporal loops can also have an impact because the deltas that we're comparing across spatial instances are actually temporal deltas at each instance.

Also, are there any estimations (or even better, any measurements) that show the possible computational expense reductions due to the newer polyhedral model? If so, how much? Orders or magnitude?

This is difficult. A lot of the set-computation logic now uses algorithms implemented in the ISL codebase. ISL is open-source and the theory is well-documented but you would need a deep expert in polyhedral compilation to determine the complexity. We are merely end users. I should note that our motivation for switching to the polyhedral analysis was more for robustness and future proofing rather than speed (which I believe is slightly worse in some situations with the new analysis code).

Thanks @angshuman-parashar. What I gather is that multicast factors could only explain part of the computational expenses. We need to look at rest of the mapping too.

Also, thanks for clarifying, my assumption was that the polyhedral model was adopted to possibly improve performance.