StanfordAHA/aha

Resnet dense layers FAIL unless preceded by e.g. gaussian test

Opened this issue · 6 comments

Executive summary: I'm not sure we are properly testing dense resnet layers.

Details:

I'm not sure what to do with this information (yet). And I have not done the work to thoroughly, under controlled conditions, verify what's really happening (yet). But here's what it looks like:

I tried using the "aha regress" command to run 'conv1' on its own, as a dense test, and it failed.

But! If I include a non-layer test e.g. gaussian in the test suite, e.g. "gaussian" followed by "conv1", then both tests pass.

I run the tests with the garnet daemon turned ON, which means that the "conv1" test (re)uses the verilog that was built for the "gaussian" test, in the case where both tests pass. I think this is relevant. But I'm not sure what it means, since both are dense tests (I think?) and both should be using the same verilog anyway (right?)

The next thing I would probably try: What happens if we run "gaussian" + "conv1" with daemon OFF? I'm guessing that conv1 fails.

Until I have better information, I guess I will file this as both a "garnet" issue and an "aha" issue...

I will include @kalhankoul96 as an assignee, because I think he'll be interested, and because he can remove himself and/or add more assignees if there's anyone else that might obviously want to know about this...

Garnet issue is here: StanfordAHA/garnet#1070

Yuchen is the Resnet guy, so I added him to take a look.

This is a bit weird. Gaussian and conv1 should both use the same RTL for dense tests. How did you exactly run 'conv1' on its own, like comment out all other layers and run "aha regress resnet"?

How did you exactly run 'conv1' on its own, like comment out all other layers and run "aha regress resnet"?

Yes, that is essentially what I did. It's possible that I did it wrong, I did not spend a lot of time investigating. I also tried conv2x on its own, with the same result...

Have you ever used "aha regress" to run a single layer and , if so, how did you do it?

I have used exactly the same way to run a single layer without errors. I will try to run conv2x on its own on my side to see whether it can pass locally.

Okay, maybe I will try again, more carefully this time, and see what happens. I will let you know if I cannot get it to work. Thanks!

I tried again, and it failed four times in a row, see https://buildkite.com/stanford-aha/aha-flow/builds/10315

So then I tried it separately in my own docker container. The complete transaction is shown below, and should be fully repeatable by anyone who wants to try it out.

Note that, in the test, conv2 runs twice, once with a sparse machine and once with a dense machine. This is what regress.py does when you give it the --include-dense-only-tests flag. The sparse version works, and then the dense version fails (unless gaussian is included in the tests (as you can see, I removed gaussian for the purposes of this experiment...))

    image=stanfordaha/garnet:latest
    docker pull $image
    container=steveri-fix-gaussian
    docker run -id --name $container --rm -v /cad:/cad $image bash

    # Now inside docker

    # Delete gaussian from pr_aha[23] test suites
    mv aha/util/regress.py aha/util/regress.py.orig
    grep -v gaussian aha/util/regress.py.orig > aha/util/regress.py
    diff aha/util/regress.py.orig aha/util/regress.py
        <         imported_tests.glb_tests = ["apps/gaussian"]
        <         imported_tests.glb_tests = ["apps/gaussian"]

    # Setup
    source /aha/bin/activate
    source /cad/modules/tcl/init/sh
    module load base incisive xcelium/19.03.003 vcs/T-2022.06-SP2

    # make /bin/sh symlink to bash instead of dash:
    echo "dash dash/sh boolean false" | debconf-set-selections
    DEBIAN_FRONTEND=noninteractive dpkg-reconfigure dash

    # Install 'time' package (what? why?)
    apt update
    apt install time

    aha regress pr_aha2 --include-dense-only-tests >& aha2.log &
    tail -f aha2.log
    
    egrep '^[-]-- ' aha2.log
        --- Running regression: pr_aha2

        --- Generating Garnet
        --- conv2_x
        --- conv2_x - compiling and mapping
        --- conv2_x - pnr and pipelining
        --- DONE PNR
        --- conv2_x - glb testing
        [ PASSED SPARSE VERSION ]

        [ BEGIN DENSE VERSION ]
        --- Generating Garnet
        --- conv2_x
        --- conv2_x - compiling and mapping
        --- conv2_x - pnr and pipelining
        --- GARNET-BUILD (apps/resnet_output_stationary)
        --- GARNET-PNR   (apps/resnet_output_stationary)
        --- DONE PNR
        --- conv2_x - glb testing

    tail aha2.log

        [APP0-resnet_output_stationary] read output_0_block_3 from glb end
        [APP0-resnet_output_stationary] read output_0_block_4 from glb start
        Error: "tb/environment.sv", 300: $unit::\Environment::wait_interrupt : at time 6179473500 ps

        make: *** [Makefile:147: run] Error 2
        subprocess.CalledProcessError: Command '['make', 'sim']' returned non-zero exit status 2.
        subprocess.CalledProcessError: Command '['aha', 'test', 'apps/resnet_output_stationary']' returned non-zero exit status 1.