StanfordAHA/garnet

PD: Nondeterminism in PE tile

Opened this issue · 0 comments

While debugging changes to the glb tile, I found intermittent/nondeterministic failures in the PE tile(!).

Example: Eight different builds used essentially the same RTL except that sometimes the GLB tile was using 64K SRAM and sometimes 256K. Meanwhile, the somewhat-unrelated PE tile would sometimes pass and sometimes fail regardless of the GLB SRAM size setting. See far below for a summary of the eight builds.

One of the failure errors seemed to be related to uniquification problems, so, following the advice of the innovus error message, I set an init_design_uniquify variable to fix that. FYI, the specific error message from Innovus was

   **WARN: (IMPECO-560): The netlist is not unique, because the module
   'Tile_PE_mux_logic_1_20' is instantiated multiple times. Make the
   netlist unique by running 'set init_design_uniquify 1' before
   loading the design to avoid the problem.
   Type 'man IMPECO-560' for more detail.

Another failure that occurred more than once was a short in M6 after postroute. I thought this might be solved by fixing the uniquification problem, but that turned out not to be the case. By trial and error, I found that the short could be fixed by adjusting my already-existing PE fix-shorts script to do ten eco-route iterations instead of just two. The difference in time was negligible, just a minute or two difference to complete ten iterations instead of two.

  +  # setNanoRouteMode -drouteEndIteration 2
  +  setNanoRouteMode -drouteEndIteration 10

It appears that sometimes we need one or both of these fixes and sometimes not, depending on randomness in the environment. But I'm hoping that leaving both fixes intact will improve robustness going forward.

Here is a summary of the eight runs that produced intermittent PE tile failures:

  GARNET   RUN      SRAM
   HASH    NAME     SIZE  RESULT
  --------------------------------------------------------------------------
  212fc7c glb4129   256K  PASSED glb_top only
  212fc7c gold.280  256K  FAILED full_chip context: uniq error + metal short
  --------------------------------------------------------------------------
  aa69f42 gold.4140 256K  PASSED was supposed to be same as gold.280
  75d1da4 gold.4141 64K   FAILED uniquification error + metal short
  9718ebb gold.4142 64K   PASSED uniquified + orig size
  4bc99c3 gold.4143 256K  PASSED uniquified + 4M run
  --------------------------------------------------------------------------
  6ef7828 gold.285  256K  FAILED M6 shorts
  f370fd8 gold.286  256K  PASSED using new fix-shorts script
  --------------------------------------------------------------------------

I am in the process of filing a couple of git pulls to fix these problems, they should appear soon.