mlcommons/power-dev

Unable to pass the checker

rakshithvasudev opened this issue · 31 comments

Hello All,

I'm not able to pass the checker:

(mlperf) rakshith@mlperf-inference-rakshith:/work$ python3.7 build/power-dev/compliance/check.py cout/2021-07-20_16-04-39_resnetserver/
[x] Check client sources checksum
[x] Check server sources checksum
[x] Check PTD commands and replies
[x] Check UUID
[x] Check session name
[x] Check time difference
[x] Check client server messages
[x] Check results checksum
[ ] Check errors and warnings from PTD logs
        '07-20-2021 21:04:40.224: ERROR: USB.' in ptd_log.txt

[x] Check PTD configuration
[x] Check debug is disabled on server-side
[ ] Check release version
        using of not-yet released version of checker


ERROR: Not all checks passed

Not sure where the ERROR: USB is coming from.

I looked into the tmpfile at C:\Users\pa\AppData\Local\Temp\tmpdf741fyg\ptd_logfile.txt created on the server side. Here are it's logs:

Time,07-20-2021 20:59:26.386,Watts,1002.600000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_15-55-18_resnetserver_testing,Ch1,Watts,538.500000,Volts,118.600000,Amps,4.918000,PF,0.923200,Ch2,Watts,464.100000,Volts,118.840000,Amps,4.919000,PF,0.794000,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 20:59:27.385,Watts,1001.700000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_15-55-18_resnetserver_testing,Ch1,Watts,538.100000,Volts,118.590000,Amps,4.914000,PF,0.923400,Ch2,Watts,463.600000,Volts,118.840000,Amps,4.914000,PF,0.793900,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 20:59:28.385,Watts,1001.700000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_15-55-18_resnetserver_testing,Ch1,Watts,538.100000,Volts,118.590000,Amps,4.914000,PF,0.923300,Ch2,Watts,463.600000,Volts,118.840000,Amps,4.914000,PF,0.793800,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 21:04:23.695,Watts,987.300000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_16-03-44_resnetserver_ranging,Ch1,Watts,531.100000,Volts,118.580000,Amps,4.846000,PF,0.924200,Ch2,Watts,456.200000,Volts,118.850000,Amps,4.846000,PF,0.792100,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 21:05:19.247,Watts,986.600000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_16-04-39_resnetserver_ranging,Ch1,Watts,530.800000,Volts,118.600000,Amps,4.842000,PF,0.924400,Ch2,Watts,455.800000,Volts,118.850000,Amps,4.842000,PF,0.792000,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 21:05:20.262,Watts,991.700000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_16-04-39_resnetserver_ranging,Ch1,Watts,533.300000,Volts,118.600000,Amps,4.866000,PF,0.924000,Ch2,Watts,458.400000,Volts,118.830000,Amps,4.866000,PF,0.792700,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 21:05:21.262,Watts,1010.400000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_16-04-39_resnetserver_ranging,Ch1,Watts,542.200000,Volts,118.590000,Amps,4.955000,PF,0.922700,Ch2,Watts,468.200000,Volts,118.830000,Amps,4.955000,PF,0.795100,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 21:05:22.262,Watts,1020.400000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_16-04-39_resnetserver_ranging,Ch1,Watts,547.000000,Volts,118.590000,Amps,5.003000,PF,0.922000,Ch2,Watts,473.400000,Volts,118.820000,Amps,5.003000,PF,0.796300,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 21:05:23.261,Watts,1023.100000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_16-04-39_resnetserver_ranging,Ch1,Watts,548.300000,Volts,118.580000,Amps,5.016000,PF,0.921900,Ch2,Watts,474.800000,Volts,118.830000,Amps,5.016000,PF,0.796700,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 21:05:24.261,Watts,1017.100000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_16-04-39_resnetserver_ranging,Ch1,Watts,545.400000,Volts,118.590000,Amps,4.987000,PF,0.922300,Ch2,Watts,471.700000,Volts,118.830000,Amps,4.987000,PF,0.796000,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 21:05:25.261,Watts,1023.200000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_16-04-39_resnetserver_ranging,Ch1,Watts,548.300000,Volts,118.590000,Amps,5.017000,PF,0.921600,Ch2,Watts,474.900000,Volts,118.850000,Amps,5.017000,PF,0.796400,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000
Time,07-20-2021 21:05:26.261,Watts,1036.900000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-07-20_16-04-39_resnetserver_ranging,Ch1,Watts,554.900000,Volts,118.610000,Amps,5.080000,PF,0.921000,Ch2,Watts,482.000000,Volts,118.870000,Amps,5.080000,PF,0.798100,Ch3,Watts,0.000000,Volts,0.000000,Amps,0.000000,PF,0.000000

Do I just have to wait until official release branch is made ready or does this need a bugfix somewhere?

Adding some more info:

@s-idgunji, In an email thread back in March with Dejan (would tag if I knew gh handle) I had mentioned seeing the ERROR: USB when PTD initializes connection to a WT333 via USB. He indicated that it was not an actual error and did not have any influence on the actual execution of the code.
I can revive the thread if needed! 😄

It's high time to create a release branch r1.1!

Agreed. Let's get it done next Wed. Do you think you can help us with getting branch r1.1 , @psyhtest ?

@s-idgunji I think I would need to have write permissions to this repository for that?

The PTDaemon executables are in the closed repo, not here.

Thanks @s-idgunji @psyhtest @trevor-cockrell . So if I'm understanding this correctly, ERROR: USB. will be resolved in the v1.1 release?

If not, I wanted to let you know that this is an issue, in addition to using of not-yet released version of checker which will be resolved in the next release :)

It's high time to create a release branch r1.1!

@psyhtest - Please refer to the other discussion with @nvpohanh where the proposal is to use power-dev

Action items:

  1. @s-idgunji will check Dejan to confirm if this is a PTD issue or a MLPerf power issue. We think it is a PTD issue.
  2. If PTD issue, we will need to work with SPEC rep to resolve this officially.

More context on this :

  1. Windows director machine.
  2. PTD version: Version 1.9.2-3976349f-20201208

I use:

  1. Linux (Ubuntu 16.04, 20.04) director machine:
  2. PTDDaemon: Version 1.9.2-e8c7a49a-20201208 (straight from the closed repository).

Apologies for being terse above when I said that a new branch is required!

A couple of months ago I hit the same error: using of not-yet released version of checker. It comes right at the end of a list of checks:

    check_with_description = {
        "Check client sources checksum": lambda: sources_check(client),
        "Check server sources checksum": lambda: sources_check(server),
        "Check PTD commands and replies": lambda: ptd_messages_check(server),
        "Check UUID": lambda: uuid_check(client, server),
        "Check session name": lambda: session_name_check(client, server),
        "Check time difference": lambda: phases_check(client, server, path),
        "Check client server messages": lambda: messages_check(client, server),
        "Check results checksum": lambda: results_check(server, client, path),
        "Check errors and warnings from PTD logs": lambda: check_ptd_logs(
            server, client, path
        ),
        "Check PTD configuration": lambda: check_ptd_config(server),
        "Check debug is disabled on server-side": lambda: debug_check(server),
        "Check release version": lambda: version_check(),
    }

The version_check() function simply throws an error in our face:

def version_check() -> None:
    """Only for master branch"""
    assert False, "using of not-yet released version of checker"

Then I looked at the r1.0 branch and found that the similar list has no version_check():

    check_with_description = {
        "Check client sources checksum": lambda: sources_check(client),
        "Check server sources checksum": lambda: sources_check(server),
        "Check PTD commands and replies": lambda: ptd_messages_check(server),
        "Check UUID": lambda: uuid_check(client, server),
        "Check session name": lambda: session_name_check(client, server),
        "Check time difference": lambda: phases_check(client, server, path),
        "Check client server messages": lambda: messages_check(client, server),
        "Check results checksum": lambda: results_check(server, client, path),
        "Check errors and warnings from PTD logs": lambda: check_ptd_logs(
            server, client, path
        ),
        "Check PTD configuration": lambda: check_ptd_config(server),
        "Check debug is disabled on server-side": lambda: debug_check(server),
    }

(In fact, the function itself was not in the file.)

This led me to believe that the assumed process is to create a new release branch of the power-dev repository when the time is right, and remove the function from the list of checks there.

However, this is probably not the last assumption. Looking at sources_checksums.json, it appears the checksum will need to be calculated for all critical files on the branch. Is this right?

If I'm given write permission to the power-dev repository, I can try and fix this tomorrow.

More context on this :

  1. Windows director machine.
  2. PTD version: Version 1.9.2-3976349f-20201208

more, more info: Version 1.9.2-3976349f-20201208 is from the closed repository as well

We need to create r1.1 branch and integrate this change: #251

For the USB issue, we should add it to the COMMON_ERROR list here: https://github.com/mlcommons/power-dev/blob/master/compliance/check.py#L65

Requested @sub-mod to get @psyhtest write access to power-dev

Branch created. The version check removed. The USB issue not dealt with yet.

The hashes are SHA1, but do not need to be updated at the moment.

anton@krai:~/projects/mlperf/power-dev-r1.1/ptd_client_server$ find . -name "*.py" -exec sha1sum {} \; | sort -k2
33ca4f26368777ac06e01f9567b714a4b8063886  ./client.py
da39a3ee5e6b4b0d3255bfef95601890afd80709  ./__init__.py
4c2b78fb4849a7e5b584ef792d82aaed20b17f57  ./lib/client.py
624d0c0acc7c39aaff3674f0b99d6a09da53d1dc  ./lib/common.py
da39a3ee5e6b4b0d3255bfef95601890afd80709  ./lib/external/__init__.py
4da8f970656505a40483206ef2b5d3dd5e81711d  ./lib/external/ntplib.py
da39a3ee5e6b4b0d3255bfef95601890afd80709  ./lib/__init__.py
24ae49fb193809cf47f2c18b1c9c7c866244be4d  ./lib/server.py
60a2e02193209e8d392803326208d5466342da18  ./lib/source_hashes.py
aa92f0a3f975eecd44d3c0cd0236342ccc9f941d  ./lib/summary.py
3210db56eb0ff0df57bf4293dc4d4b03fffd46f1  ./lib/time_sync.py
c3f90f2f7eeb4db30727556d0c815ebc89b3d28b  ./server.py
da39a3ee5e6b4b0d3255bfef95601890afd80709  ./tests/unit/__init__.py
99ae15aef722f2000ee6ed1ae1523637bf1ae42b  ./tests/unit/test_server.py
00468a2907583c593e6574a1f6b404e4651c221a  ./tests/unit/test_source_hashes.py

@psyhtest @rakshithvasudev Could you share with me the logs with that error? I need them to verify that my fix works

Hello @nvpohanh thanks for the fix. The branch issue is fixed. But the USB issue still persists. My understanding is we've not looked into the USB error yet. I'll email you the log file. Sorry I don't know if I could upload logs to a public location just yet. Here are the results before and after fix:

Before:

[x] Check PTD commands and replies
[x] Check UUID
[x] Check session name
[x] Check time difference
[x] Check client server messages
[x] Check results checksum
[ ] Check errors and warnings from PTD logs
        '07-28-2021 15:18:10.465: ERROR: USB.' in ptd_log.txt

[x] Check PTD configuration
[x] Check debug is disabled on server-side
[ ] Check release version
        using of not-yet released version of checker

ERROR: Not all checks passed

After:

[x] Check server sources checksum
[x] Check PTD commands and replies
[x] Check UUID
[x] Check session name
[x] Check time difference
[x] Check client server messages
[x] Check results checksum
[ ] Check errors and warnings from PTD logs
        '07-28-2021 15:18:10.465: ERROR: USB.' in ptd_log.txt

[x] Check PTD configuration
[x] Check debug is disabled on server-side

ERROR: Not all checks passed

@rakshithvasudev - Is this issue resolved and can be closed ?

The PR went into the master branch instead of r1.1 so @rakshithvasudev cannot benefit from it just yet. I will bring it in together with a solution for auto-ranging which I'm currently verifying.

Thanks @psyhtest I assume this is expected now?

(mlperf) rakshith@mlperf-inference-rakshith-x86_64:/work/build$ python power-dev/compliance/check.py power_logs/2021.08.05-21.58.23/
[x] Check client sources checksum
[x] Check server sources checksum
[x] Check PTD commands and replies
[x] Check UUID
[x] Check session name
[ ] Check time difference
Unhandled exeception:
Traceback (most recent call last):
  File "power-dev/compliance/check.py", line 651, in check_with_logging
    check()
  File "power-dev/compliance/check.py", line 680, in <lambda>
    "Check time difference": lambda: phases_check(client, server, path),
  File "power-dev/compliance/check.py", line 352, in phases_check
    system_begin_r, system_end_r = _get_begin_end_time_from_mlperf_log_detail(
  File "power-dev/compliance/check.py", line 277, in _get_begin_end_time_from_mlperf_log_detail
    with open(file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'power_logs/2021.08.05-21.58.23/ranging/mlperf_log_detail.txt'
[x] Check client server messages
[ ] Check results checksum
        power_logs/2021.08.05-21.58.23/power/server.json + power_logs/2021.08.05-21.58.23/power/client.json results checksum values and calculated power_logs/2021.08.05-21.58.23/ content checksum comparison:
 Missing 'ranging/mlperf_log_detail.txt, ranging/mlperf_log_summary.txt, run_1/mlperf_log_detail.txt, run_1/mlperf_log_summary.txt'

[ ] Check errors and warnings from PTD logs
Unhandled exeception:
Traceback (most recent call last):
  File "power-dev/compliance/check.py", line 651, in check_with_logging
    check()
  File "power-dev/compliance/check.py", line 683, in <lambda>
    "Check errors and warnings from PTD logs": lambda: check_ptd_logs(
  File "power-dev/compliance/check.py", line 503, in check_ptd_logs
    start_load_time, stop_load_time = _get_begin_end_time_from_mlperf_log_detail(
  File "power-dev/compliance/check.py", line 277, in _get_begin_end_time_from_mlperf_log_detail
    with open(file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'power_logs/2021.08.05-21.58.23/run_1/mlperf_log_detail.txt'
[x] Check PTD configuration
[x] Check debug is disabled on server-side
[ ] Check release version
        using of not-yet released version of checker


ERROR: Not all checks passed

@rakshithvasudev - How do you plan to make forward progress till this is resolved ? Leaving aside the checker, is your flow working fine ?

Yes @s-idgunji, I'm able to get perf and power numbers. I'm just not able to pass the compliance checker. Thanks!

@rakshithvasudev Could you check why power_logs/2021.08.05-21.58.23/ranging/mlperf_log_detail.txt does not exist? After you run a power run, the mlperf_log_detail.txt file should exist under the ranging/ directory.

@rakshithvasudev As @nvpohanh has noticed, you seem to be missing log files.

Also, you appear to be using master rather than the r1.1 branch:

[ ] Check release version
        using of not-yet released version of checker

The only commit that is on that branch for now removes this check.

Thanks @psyhtest @nvpohanh looking more into what is happening incorrectly. Will let you know the progress.

I was looking at a different folder. That was my bad. Was able to pass.

Thanks everybody!

Closing this :)

That's great @rakshithvasudev!

I've merged your fix into the r1.1 branch.

Thanks a lot @psyhtest !