async 'sf' cmd can trip lock bug in dasd caching causing CRASH

Question

async 'sf' cmd can trip lock bug in dasd caching causing CRASH

Fish-Git opened this issue 9 months ago · 0 comments

Currently, if you run a script containing e.g:

sf-333 nomerge
sf-338 nomerge
sf-339 nomerge

to individually remove multiple shadow files one after the other in quick succession, depending on the timing, Hercules sometimes crashes somewhere in the cache.c dasd caching logic when it tries to acquire a lock that was previously destroyed by one of the other sf threads that happen to be running at the same time.

Sometimes we get something like:

01:03:04.619 hthreads.c(105)   HHC90013E 'destroy_lock(&cacheblk[ix].lock)' failed: rc=16: busy; tid=00004208, loc=cache.c:469
01:03:04.619 hthreads.c(118)   HHC90014I lock &cacheblk[ix].lock was obtained by thread 00003990 at cache.c:175
01:03:04.619 hthreads.c(105)   HHC90013E 'initialize lock(&cacheblk[ix].lock)' failed: rc=17: already init'ed; tid=00004208, loc=cache.c:496
01:03:04.619 hthreads.c(110)   HHC90028I lock &cacheblk[ix].lock was already initialized at cache.c:496
01:03:04.619 hthreads.c(118)   HHC90014I lock &cacheblk[ix].lock was still held by thread 00003990 at cache.c:175

and then Hercules eventually just unexpectedly exits! That is to say, it doesn't crash, it just unexpectedly EXITS! (i.e. powers off!)

Other times it actually crashes though.

The actual test script to reproduce the bug is much longer than the simple one illustrated above, but the approach was the same: create shadow files, remove shadow files, repeat, over and over again, many, many times. This causes many simultaneous sf command threads to get created and run at the same time, eventually causing the bug to trip.

I haven't tried to debug the caching logic yet to find the bug, but I have managed to identify a simple workaround: simply don't run the sf commands asynchronously!

--- hyperion-1/hsccmd.c	2024-01-24 13:18:12.392121600 -0800
+++ hyperion-0/hsccmd.c	2024-02-06 00:06:02.564250300 -0800
@@ -7040,7 +7040,6 @@
 U16      lcss;                          /* Logical CSS               */
 int      flag = 1;                      /* sf- flag (default merge)  */
 int      level = 2;                     /* sfk level (default 2)     */
-TID      tid;                           /* sf command thread id      */
 char     c;                             /* work for sscan            */
 int      rc;
 
@@ -7183,11 +7182,20 @@
     /* Process the command */
     switch (action)
     {
+#if defined( OPTION_NOASYNC_SF_CMDS )
+        case '+': cckd_sf_add   ( dev ); break;
+        case '-': cckd_sf_remove( dev ); break;
+        case 'c': cckd_sf_comp  ( dev ); break;
+        case 'd': cckd_sf_stats ( dev ); break;
+        case 'k': cckd_sf_chk   ( dev ); break;
+#else
+        TID tid;
         case '+': if (create_thread( &tid, DETACHED, cckd_sf_add,    dev, "sf+ command" )) cckd_sf_add   ( dev ); break;
         case '-': if (create_thread( &tid, DETACHED, cckd_sf_remove, dev, "sf- command" )) cckd_sf_remove( dev ); break;
         case 'c': if (create_thread( &tid, DETACHED, cckd_sf_comp,   dev, "sfc command" )) cckd_sf_comp  ( dev ); break;
         case 'd': if (create_thread( &tid, DETACHED, cckd_sf_stats,  dev, "sfd command" )) cckd_sf_stats ( dev ); break;
         case 'k': if (create_thread( &tid, DETACHED, cckd_sf_chk,    dev, "sfk command" )) cckd_sf_chk   ( dev ); break;
+#endif
     }
 
     return 0;

The original author (Greg Smith) changed the sf_cmd function to process the various sf commands in their own thread back on 1/3/2008 in commit cde906b (although back then the function was called ShadowFile_cmd). Why he did this I don't know. Perhaps he was thinking that the 'remove', 'comp' or 'chk' functions might take a while, and so in order to prevent a long-running command from preventing the user from being able to enter another command, he decided to process each of them in their own thread.

Personally I don't see any of the sf commands as likely taking "a long time". Sure, they might not complete immediately, but none of them should take so long as to be a problem IMO. (Others may feel differently of course.)

IN ANY CASE, I am creating this GitHub Issue to record the fact that there is an as-yet an unidentified lock handling bug somewhere in Hercules's dasd caching logic in cache.c that needs to be found and fixed.

OR...., we could of course implement the workaround permanently instead of temporarily.

FOR THE TIME BEING HOWEVER, I am implementing the proposed temporary workaround, by #defining OPTION_NOASYNC_SF_CMDS in featall.h so as to prevent Hercules from potentially crashing until the bug can eventually be found and fixed.

That is all.