Resnet dense layers FAIL unless preceded by e.g. gaussian test
Opened this issue · 6 comments
Executive summary: I'm not sure we are properly testing dense resnet layers.
Details:
I'm not sure what to do with this information (yet). And I have not done the work to thoroughly, under controlled conditions, verify what's really happening (yet). But here's what it looks like:
I tried using the "aha regress" command to run 'conv1' on its own, as a dense test, and it failed.
But! If I include a non-layer test e.g. gaussian in the test suite, e.g. "gaussian" followed by "conv1", then both tests pass.
I run the tests with the garnet daemon turned ON, which means that the "conv1" test (re)uses the verilog that was built for the "gaussian" test, in the case where both tests pass. I think this is relevant. But I'm not sure what it means, since both are dense tests (I think?) and both should be using the same verilog anyway (right?)
The next thing I would probably try: What happens if we run "gaussian" + "conv1" with daemon OFF? I'm guessing that conv1 fails.
Until I have better information, I guess I will file this as both a "garnet" issue and an "aha" issue...
I will include @kalhankoul96 as an assignee, because I think he'll be interested, and because he can remove himself and/or add more assignees if there's anyone else that might obviously want to know about this...
Garnet issue is here: StanfordAHA/garnet#1070
Yuchen is the Resnet guy, so I added him to take a look.
This is a bit weird. Gaussian and conv1 should both use the same RTL for dense tests. How did you exactly run 'conv1' on its own, like comment out all other layers and run "aha regress resnet"?
How did you exactly run 'conv1' on its own, like comment out all other layers and run "aha regress resnet"?
Yes, that is essentially what I did. It's possible that I did it wrong, I did not spend a lot of time investigating. I also tried conv2x
on its own, with the same result...
Have you ever used "aha regress" to run a single layer and , if so, how did you do it?
I have used exactly the same way to run a single layer without errors. I will try to run conv2x on its own on my side to see whether it can pass locally.
Okay, maybe I will try again, more carefully this time, and see what happens. I will let you know if I cannot get it to work. Thanks!
I tried again, and it failed four times in a row, see https://buildkite.com/stanford-aha/aha-flow/builds/10315
So then I tried it separately in my own docker container. The complete transaction is shown below, and should be fully repeatable by anyone who wants to try it out.
Note that, in the test, conv2 runs twice, once with a sparse machine and once with a dense machine. This is what regress.py does when you give it the --include-dense-only-tests
flag. The sparse version works, and then the dense version fails (unless gaussian is included in the tests (as you can see, I removed gaussian for the purposes of this experiment...))
image=stanfordaha/garnet:latest
docker pull $image
container=steveri-fix-gaussian
docker run -id --name $container --rm -v /cad:/cad $image bash
# Now inside docker
# Delete gaussian from pr_aha[23] test suites
mv aha/util/regress.py aha/util/regress.py.orig
grep -v gaussian aha/util/regress.py.orig > aha/util/regress.py
diff aha/util/regress.py.orig aha/util/regress.py
< imported_tests.glb_tests = ["apps/gaussian"]
< imported_tests.glb_tests = ["apps/gaussian"]
# Setup
source /aha/bin/activate
source /cad/modules/tcl/init/sh
module load base incisive xcelium/19.03.003 vcs/T-2022.06-SP2
# make /bin/sh symlink to bash instead of dash:
echo "dash dash/sh boolean false" | debconf-set-selections
DEBIAN_FRONTEND=noninteractive dpkg-reconfigure dash
# Install 'time' package (what? why?)
apt update
apt install time
aha regress pr_aha2 --include-dense-only-tests >& aha2.log &
tail -f aha2.log
egrep '^[-]-- ' aha2.log
--- Running regression: pr_aha2
--- Generating Garnet
--- conv2_x
--- conv2_x - compiling and mapping
--- conv2_x - pnr and pipelining
--- DONE PNR
--- conv2_x - glb testing
[ PASSED SPARSE VERSION ]
[ BEGIN DENSE VERSION ]
--- Generating Garnet
--- conv2_x
--- conv2_x - compiling and mapping
--- conv2_x - pnr and pipelining
--- GARNET-BUILD (apps/resnet_output_stationary)
--- GARNET-PNR (apps/resnet_output_stationary)
--- DONE PNR
--- conv2_x - glb testing
tail aha2.log
[APP0-resnet_output_stationary] read output_0_block_3 from glb end
[APP0-resnet_output_stationary] read output_0_block_4 from glb start
Error: "tb/environment.sv", 300: $unit::\Environment::wait_interrupt : at time 6179473500 ps
make: *** [Makefile:147: run] Error 2
subprocess.CalledProcessError: Command '['make', 'sim']' returned non-zero exit status 2.
subprocess.CalledProcessError: Command '['aha', 'test', 'apps/resnet_output_stationary']' returned non-zero exit status 1.