mlcommons/power-dev

Allow fixed range setting for high power devices, optionally remove ranging mode

Closed this issue ยท 16 comments

Edit:
Since there are a lot of confusion I'm rewriting the proposal again.

Current power workflow methodology

  1. Do a ranging mode run where the current range is automatically adjusted and peak current usage for the benchmark run is determined.
  2. Do a testing mode run using the determined peak current range from the ranging mode run
  3. Due to the ranging mode run, total run time for power runs is doubled compared to a non-power run

New proposal

  1. Allow fixed range setting (Section 2.6 in this SPEC document) of current range for testing mode run without using a ranging mode run - this is an option and if user wants he can still do the ranging mode run
  2. If the user sets wrong range, PTDaemon will flag it in the uncertainty samples and the run automatically gets invalidated
  3. This methodology is recommended by SPEC- Section 2.6 in this SPEC document. Section 2.5.2 of SPEC Power also mentions to check uncertainty levels.
  4. Due to a bug in PTDaemon, this flagging mechanism is not proper for low power devices (<75W) and so can only be used for devices above 75W.
  5. Makes power runs take same runtime as non-power runs for systems with known current range (99% of the submitted systems).

We had a previous discussion and after testing found the limitation of PTDaemon which did not allow us to use manual range setting as was envisaged here. Since the new PTDaemon might not make it for 3.1 code freeze time I'm proposing the following

  1. Allow manual range setting for devices above 75W where there is no issue with crest factor. Currently manual range setting is allowed only as an experimental option in the master branch.
  2. Allowing this option should not be affecting the accuracy of power measured in any way as long as the power usage is above 75W. Any wrong manual setting will cause uncertainty samples and the result will be invalidated by the submission checker.
  3. Manual range setting will only be optional and submitters who are not sure of the current range can continue to use the ranging plus testing runs as of now.
  4. Once we have a fix for below 75W devices, we can extend this option to those devices too.
  5. This change can mean some power submissions will have ranging folder and some don't. But as long as the power measured is accurate I don't think this should be an issue.
  6. With this change we can reduce the power run times to at least half, there by freeing the system and power analyzer for more submissions and if not, at least reduce global warming.

This will be particularly useful for long running benchmarks like 3d-unet which can take hours even on a good GPU.

We already had a detailed discussion on manual range setting here and this method is validated by SPEC. Unfortunately this cannot be applied to low power devices (<75W) until SPEC PTDaemon is updated but for other devices there should not be any difference in the power being measured -- if user sets invalid range, uncertainty will be high and measurement will be invalid.

  • Link to SPEC Power Methodology - Ranging mode is not mandatory in SPEC methodology.
  • As per communication from SPEC team, we can safely use "Manual range setting" and rely on "Uncertainty" output to invalidate results.
  • SPEC wants the uncertainty measurement to be taken using Uncertainty command and not via logs - this change is already done.

Percentage of power results

Power submissions are time consuming just because of the ranging mode doubling the run duration. This is one of the main reason the percentage of power results are going down. This should go down further in 3.1 because 3 new models are coming (only 1 being removed) and so submitters will have less time to do power. So, I request the power WG to seriously consider this proposal or at least give a strong reason why submitters are supposed to spend twice the amount of time for doing power submissions.

Inference 2.1: Results=5395, Power Results=2475, Percentage of power results = 45.8
Inference 3.0 Results=7283, Power Results=2449, Percentage of power results = 33.6
Inference 3.1 ?

@psyhtest Yesterday you had asked for having manual range setting as an experimental option - this is already in the master branch and should be working as expected.

Documenting some of my thoughts on this.

  • We have also spent significant cycles discussing this issues in the past. #270 (and also before in 2021) and arrived at a conclusion that there is a reason for what we are doing currently and we should look to continue this approach. Please refer to the MLPerf PowerWG discussion notes.

  • This short term proposal calls for a practice that is deviating from the industry standard methodology. While a lot of the power measurement standardization approaches are much more rigorous (Olympic scoring); at MLC, we adopted a somewhat lenient, yet realistic approach of power measurement that requires 2 steps - Ranging and Testing. These were well thought out and documented approaches and I do not think the problem statement calls for this solution.

  • The problem statement seems to be time taken for doing power measurements. While this is a true statement, we have heard complaints about this approach from only 1-2 organization consistently. It does not seem to represent the vast majority of submitting organizations. As seen in the above point, power measurements across the industry adopt a more time consuming approach which is deemed industry standard for power measurement methodologies.

  • One rationale being given for this proposal is that we do it for systems below 75W and hence there is precedence. It is to be noted that the methodology for systems below 75W is likely to change completely for other technical reasons which are not applicable to the broader category and hence we should be careful in changing a well working methodology in favor of something that is bound to change.

  • The other rationale that the number of power submissions are going down version over version is inconsistent with the PowerWG messaging. Please see PowerWG notes for what is being reported.

Summarizing, I see this proposal as a drastic change from current methodology (and importantly is not backed by data) for what is labelled as a short term fix (as indicated in the PR) and as a good practice, we should avoid to do it as much as possible.

what I would like to propose is to close out this PR with these comments. Will wait to discuss in PowerWG on 5/23.

"and arrived at a conclusion that there is a reason for what we are doing currently and we should look to continue this approach"

When did we arrive at that conclusion? If I believe that was only for low power devices which indeed had an issue due to PTDaemon. Now when we move to DC power measurement for low power devices I dont see that argument holding here.

"This short term proposal calls for a practice that is deviating from the industry standard methodology."

Can you please say how? We already confirmed with SPEC that using uncertainty command without ranging mode is indeed industrial standard accepted by SPEC.

"The problem statement seems to be time taken for doing power measurements. While this is a true statement, we have heard complaints about this approach from only 1-2 organization consistently. It does not seem to represent the vast majority of submitting organizations. As seen in the above point, power measurements across the industry adopt a more time consuming approach which is deemed industry standard for power measurement methodologies."

As I told in earlier comment, there is no deviation from industrial standard methodology here. If so please prove it and I'm happy to close this issue. More over doesn't just 2 organizations contribute to more than 95% of all the power results?

"The other rationale that the number of power submissions are going down version over version is inconsistent with the PowerWG messaging. Please see PowerWG notes for what is being reported."

I took the data directly from the submission results.

Section 3.16.2 in this SPEC document clearly says how the power results are getting validated there. This is exactly what my proposal is doing. @dmiskovic-NV please correct me if I'm wrong here.

Regarding data - we already shared many datapoints here which actually unearthed the problem with low power devices. For high power devices there are no issues and we can share 10 more data points on different benchmarks if needed or any submitter can try them on their system -- if there is even 1 wrong measurement we can close this issue.

Krai

[2023-05-23 20:51:58,326 submission_checker.py:2856 INFO] Results=3994, NoResults=0, Power Results=1735
[2023-05-23 20:51:58,326 submission_checker.py:2859 INFO] ---
[2023-05-23 20:51:58,326 submission_checker.py:2860 INFO] Closed Results=67, Closed Power Results=40

[2023-05-23 20:51:58,326 submission_checker.py:2861 INFO] Open Results=3927, Open Power Results=1695

[2023-05-23 20:51:58,326 submission_checker.py:2862 INFO] Network Results=0, Network Power Results=0

[2023-05-23 20:51:58,326 submission_checker.py:2863 INFO] ---
[2023-05-23 20:51:58,326 submission_checker.py:2865 INFO] Systems=20, Power Systems=10
[2023-05-23 20:51:58,326 submission_checker.py:2866 INFO] Closed Systems=18, Closed Power Systems=10
[2023-05-23 20:51:58,326 submission_checker.py:2867 INFO] Open Systems=16, Open Power Systems=7
[2023-05-23 20:51:58,326 submission_checker.py:2868 INFO] Network Systems=0, Network Power Systems=0
[2023-05-23 20:51:58,326 submission_checker.py:2869 INFO] ---
[2023-05-23 20:51:58,326 submission_checker.py:2874 INFO] SUMMARY: submission looks OK

cTuning

[2023-05-23 20:41:05,744 submission_checker.py:2856 INFO] Results=1949, NoResults=0, Power Results=529
[2023-05-23 20:41:05,745 submission_checker.py:2859 INFO] ---
[2023-05-23 20:41:05,745 submission_checker.py:2860 INFO] Closed Results=26, Closed Power Results=19

[2023-05-23 20:41:05,745 submission_checker.py:2861 INFO] Open Results=1923, Open Power Results=510

[2023-05-23 20:41:05,745 submission_checker.py:2862 INFO] Network Results=0, Network Power Results=0

[2023-05-23 20:41:05,745 submission_checker.py:2863 INFO] ---
[2023-05-23 20:41:05,745 submission_checker.py:2865 INFO] Systems=47, Power Systems=10
[2023-05-23 20:41:05,745 submission_checker.py:2866 INFO] Closed Systems=5, Closed Power Systems=3
[2023-05-23 20:41:05,745 submission_checker.py:2867 INFO] Open Systems=47, Open Power Systems=9
[2023-05-23 20:41:05,745 submission_checker.py:2868 INFO] Network Systems=0, Network Power Systems=0
[2023-05-23 20:41:05,745 submission_checker.py:2869 INFO] ---
[2023-05-23 20:41:05,745 submission_checker.py:2874 INFO] SUMMARY: submission looks OK
arjun@hp-envy:~/inference/tools/submission$ python3 submission_checker.py --input  ~/inference_results_v3.0 --submitter cTuning --skip-meaningful-fields-empty-check --skip-empty-files-check

Qualcomm

[2023-05-23 21:00:03,729 submission_checker.py:2856 INFO] Results=107, NoResults=0, Power Results=65
[2023-05-23 21:00:03,729 submission_checker.py:2859 INFO] ---
[2023-05-23 21:00:03,729 submission_checker.py:2860 INFO] Closed Results=88, Closed Power Results=56

[2023-05-23 21:00:03,729 submission_checker.py:2861 INFO] Open Results=15, Open Power Results=9

[2023-05-23 21:00:03,729 submission_checker.py:2862 INFO] Network Results=4, Network Power Results=0

[2023-05-23 21:00:03,729 submission_checker.py:2863 INFO] ---
[2023-05-23 21:00:03,729 submission_checker.py:2865 INFO] Systems=11, Power Systems=7
[2023-05-23 21:00:03,729 submission_checker.py:2866 INFO] Closed Systems=11, Closed Power Systems=7
[2023-05-23 21:00:03,729 submission_checker.py:2867 INFO] Open Systems=3, Open Power Systems=2
[2023-05-23 21:00:03,729 submission_checker.py:2868 INFO] Network Systems=1, Network Power Systems=0
[2023-05-23 21:00:03,729 submission_checker.py:2869 INFO] ---
[2023-05-23 21:00:03,729 submission_checker.py:2874 INFO] SUMMARY: submission looks OK

NVIDIA

[2023-05-23 20:58:59,886 submission_checker.py:2856 INFO] Results=268, NoResults=0, Power Results=46
[2023-05-23 20:58:59,886 submission_checker.py:2859 INFO] ---
[2023-05-23 20:58:59,886 submission_checker.py:2860 INFO] Closed Results=262, Closed Power Results=46

[2023-05-23 20:58:59,886 submission_checker.py:2861 INFO] Open Results=0, Open Power Results=0

[2023-05-23 20:58:59,886 submission_checker.py:2862 INFO] Network Results=6, Network Power Results=0

[2023-05-23 20:58:59,886 submission_checker.py:2863 INFO] ---
[2023-05-23 20:58:59,886 submission_checker.py:2865 INFO] Systems=19, Power Systems=3
[2023-05-23 20:58:59,886 submission_checker.py:2866 INFO] Closed Systems=18, Closed Power Systems=3
[2023-05-23 20:58:59,886 submission_checker.py:2867 INFO] Open Systems=0, Open Power Systems=0
[2023-05-23 20:58:59,886 submission_checker.py:2868 INFO] Network Systems=1, Network Power Systems=0
[2023-05-23 20:58:59,886 submission_checker.py:2869 INFO] ---
[2023-05-23 20:58:59,886 submission_checker.py:2874 INFO] SUMMARY: submission looks OK

Dell

[2023-05-23 21:09:36,333 submission_checker.py:2856 INFO] Results=211, NoResults=0, Power Results=40
[2023-05-23 21:09:36,333 submission_checker.py:2859 INFO] ---
[2023-05-23 21:09:36,333 submission_checker.py:2860 INFO] Closed Results=211, Closed Power Results=40

[2023-05-23 21:09:36,333 submission_checker.py:2861 INFO] Open Results=0, Open Power Results=0

[2023-05-23 21:09:36,333 submission_checker.py:2862 INFO] Network Results=0, Network Power Results=0

[2023-05-23 21:09:36,333 submission_checker.py:2863 INFO] ---
[2023-05-23 21:09:36,334 submission_checker.py:2865 INFO] Systems=21, Power Systems=4
[2023-05-23 21:09:36,334 submission_checker.py:2866 INFO] Closed Systems=21, Closed Power Systems=4
[2023-05-23 21:09:36,334 submission_checker.py:2867 INFO] Open Systems=0, Open Power Systems=0
[2023-05-23 21:09:36,334 submission_checker.py:2868 INFO] Network Systems=0, Network Power Systems=0
[2023-05-23 21:09:36,334 submission_checker.py:2869 INFO] ---
[2023-05-23 21:09:36,334 submission_checker.py:2874 INFO] SUMMARY: submission looks OK

Total

[2023-05-23 21:05:51,774 submission_checker.py:2856 INFO] Results=7283, NoResults=0, Power Results=2449
[2023-05-23 21:05:51,774 submission_checker.py:2859 INFO] ---
[2023-05-23 21:05:51,774 submission_checker.py:2860 INFO] Closed Results=1333, Closed Power Results=232

[2023-05-23 21:05:51,774 submission_checker.py:2861 INFO] Open Results=5936, Open Power Results=2217

[2023-05-23 21:05:51,774 submission_checker.py:2862 INFO] Network Results=14, Network Power Results=0

[2023-05-23 21:05:51,774 submission_checker.py:2863 INFO] ---
[2023-05-23 21:05:51,774 submission_checker.py:2865 INFO] Systems=200, Power Systems=40
[2023-05-23 21:05:51,774 submission_checker.py:2866 INFO] Closed Systems=134, Closed Power Systems=32
[2023-05-23 21:05:51,774 submission_checker.py:2867 INFO] Open Systems=88, Open Power Systems=19
[2023-05-23 21:05:51,774 submission_checker.py:2868 INFO] Network Systems=3, Network Power Systems=0
[2023-05-23 21:05:51,774 submission_checker.py:2869 INFO] ---
[2023-05-23 21:05:51,774 submission_checker.py:2874 INFO] SUMMARY: submission looks OK

Contribution by Krai + cTuning = 1735 + 529 = 2264 power results out of 2449 in total. > 92%. So, we have 2 submitters contributing to > 92% of all power results and 5 submitters contributing to > 98.6% of all the power results. Out of these I believe Krai is doing mostly low power devices and so this may not be important for them. Unless we have a valid justification to reject this proposal, on our part it makes sense to do 2X non power submissions instead of unnecessarily doing a ranging run to show case power. For those submitters doing power on just 1-2 systems waiting for 1 hour more is not a big deal.

@rakshithvasudev I hope you are also interested in this proposal as anyone running 3d-unet will be ๐Ÿ˜„

mgoin commented

I would like to second the importance of removing unnecessary runs and time required for power submissions. This steep increase (at least 2x time for each benchmark) has deterred Neural Magic from contributing power results on several hardware platforms. This change would absolutely help lower the barrier to entry while simultaneously encouraging more holistic and thorough submissions to MLPerf.

@dmiskovic-NV can you please confirm if this proposal is fine from the SPEC power side? I believe you had seen the replies from Greg.

From my side, Im going to do at least 1000 power submissions in 3.1 round with manual range setting whether they'll be officially approved or not.

Hi @arjunsuresh All communications to/from SPEC regarding PTD need to go through the official channels. That means myself or the WG chairs and the official SPEC email address.

It's not fair or reasonable to ask dmiskovic-NV to provide an official answer on behalf of SPEC as that is outside of his job and role.

Please note: This means @arjunsuresh should not be inquiring to SPEC on behalf of MLCommons. That is the role of the WG chair or the executive director. SPEC has specifically asked that all inquiries be handled in a particular manner to avoid confusion or problems.

I would like to second the importance of removing unnecessary runs and time required for power submissions. This steep increase (at least 2x time for each benchmark) has deterred Neural Magic from contributing power results on several hardware platforms. This change would absolutely help lower the barrier to entry while simultaneously encouraging more holistic and thorough submissions to MLPerf.

@mgoin - Thanks for speaking up, appreciate the perspective! What is the total run time for the benchmarks with and without power (don't need exact, ballpark is good enough)?

Thank you for your reply @TheKanter. Actually I communicated with SPEC only once as directed by the power WG with the WG chairs in CC. Their reply to this issue is captured in this comment: #270 (comment)

I only asked @dmiskovic-NV for his interpretation as some people interpret this reply as "no" whereas for me it's clearly an "yes".

" What is the total run time for the benchmarks with and without power (don't need exact, ballpark is good enough)?"

@TheKanter Actually for last inference round we helped get the power results for Neural Magic as their deepsparse implementation is integrated in CK/CM. The power runs take 2X the time compared to any non-power runs. For the optimized runs like the one for Neural magic it is a change from 10 minutes to 20 minutes for doing power. But for baseline power comparison we also had to run the native run (onnxruntime on CPU ) which took close to 2 hours just for the offline scenario of a single benchmark.

Also the problem with ranging mode is not just doubling of the runtime. Say we are having 3 submission systems and just 1 power analyzer. If we have say 6 hours of non-power runs on each system the submission times are as follows:

  • Non-power: 6 hours as all runs can happen at the same time
  • Power with ranging: 2 * 6 * 3 = 36 hours (runs have to be sequential as only 1 power analyzer is there and takes twice the time due to ranging mode)
  • Power with manual range setting: 6 * 3 = 18 hours

@araghun - I'm leaving my comments as per discussions with David Kanter and you

  • Manual ranging is not an option we'd like since it is not consistent with approaches where all submitters use the same flow without any "manual" settings.
  • Performance also does accuracy phase where we get two sets of runs , the perf during accuracy and the perf during measurement which is compared to be within some % of each other. This is consistent for edge or data center
  • We want consistent flows. Making a flow that is checking ranging vs testing (measured/submitted ) runs for edge devices that are mostly under 75W and not doing this for data center does not have consistent approaches
  • A shorter ranging that can make submissions more productive has not been investigated
  • Saying that data center is impacted but edge is not , is inconsistent.
  • SPEC originally recommended the ranging for eliminating manual intervention (Klaus during v1.0 - as captured in the MLPerf Power notes) and they also use a limited ranging mode for their server power benchmarks called SERT

We need to make sure all these aspects are addressed

"SPEC originally recommended the ranging for eliminating manual intervention (Klaus during v1.0 - as captured in the MLPerf Power notes) and they also use a limited ranging mode for their server power benchmarks called SERT"

@s-idgunji Can you please point to the exact recommendation from Klaus? 'ranging mode' is always good to have for first time users and there is no doubt in it. I would like to know if Klaus or anyone from SPEC has disallowed "manual range" setting as it is what SPEC power is allowing in their documentation.

Here, is the shorter ranging run proposal.

#315

Those who are opposing and never did any power submissions can bring in new arguments ๐Ÿ™‚

Since this mechanism is there this issue is no longer relevant. Hence, closing.