async 'sf' cmd can trip lock bug in dasd caching causing CRASH
Fish-Git opened this issue · 0 comments
Currently, if you run a script containing e.g:
sf-333 nomerge
sf-338 nomerge
sf-339 nomerge
to individually remove multiple shadow files one after the other in quick succession, depending on the timing, Hercules sometimes crashes somewhere in the cache.c
dasd caching logic when it tries to acquire a lock that was previously destroyed by one of the other sf
threads that happen to be running at the same time.
Sometimes we get something like:
01:03:04.619 hthreads.c(105) HHC90013E 'destroy_lock(&cacheblk[ix].lock)' failed: rc=16: busy; tid=00004208, loc=cache.c:469 01:03:04.619 hthreads.c(118) HHC90014I lock &cacheblk[ix].lock was obtained by thread 00003990 at cache.c:175 01:03:04.619 hthreads.c(105) HHC90013E 'initialize lock(&cacheblk[ix].lock)' failed: rc=17: already init'ed; tid=00004208, loc=cache.c:496 01:03:04.619 hthreads.c(110) HHC90028I lock &cacheblk[ix].lock was already initialized at cache.c:496 01:03:04.619 hthreads.c(118) HHC90014I lock &cacheblk[ix].lock was still held by thread 00003990 at cache.c:175
and then Hercules eventually just unexpectedly exits! That is to say, it doesn't crash, it just unexpectedly EXITS! (i.e. powers off!)
Other times it actually crashes though.
The actual test script to reproduce the bug is much longer than the simple one illustrated above, but the approach was the same: create shadow files, remove shadow files, repeat, over and over again, many, many times. This causes many simultaneous sf
command threads to get created and run at the same time, eventually causing the bug to trip.
I haven't tried to debug the caching logic yet to find the bug, but I have managed to identify a simple workaround: simply don't run the sf
commands asynchronously!
--- hyperion-1/hsccmd.c 2024-01-24 13:18:12.392121600 -0800
+++ hyperion-0/hsccmd.c 2024-02-06 00:06:02.564250300 -0800
@@ -7040,7 +7040,6 @@
U16 lcss; /* Logical CSS */
int flag = 1; /* sf- flag (default merge) */
int level = 2; /* sfk level (default 2) */
-TID tid; /* sf command thread id */
char c; /* work for sscan */
int rc;
@@ -7183,11 +7182,20 @@
/* Process the command */
switch (action)
{
+#if defined( OPTION_NOASYNC_SF_CMDS )
+ case '+': cckd_sf_add ( dev ); break;
+ case '-': cckd_sf_remove( dev ); break;
+ case 'c': cckd_sf_comp ( dev ); break;
+ case 'd': cckd_sf_stats ( dev ); break;
+ case 'k': cckd_sf_chk ( dev ); break;
+#else
+ TID tid;
case '+': if (create_thread( &tid, DETACHED, cckd_sf_add, dev, "sf+ command" )) cckd_sf_add ( dev ); break;
case '-': if (create_thread( &tid, DETACHED, cckd_sf_remove, dev, "sf- command" )) cckd_sf_remove( dev ); break;
case 'c': if (create_thread( &tid, DETACHED, cckd_sf_comp, dev, "sfc command" )) cckd_sf_comp ( dev ); break;
case 'd': if (create_thread( &tid, DETACHED, cckd_sf_stats, dev, "sfd command" )) cckd_sf_stats ( dev ); break;
case 'k': if (create_thread( &tid, DETACHED, cckd_sf_chk, dev, "sfk command" )) cckd_sf_chk ( dev ); break;
+#endif
}
return 0;
The original author (Greg Smith) changed the sf_cmd
function to process the various sf
commands in their own thread back on 1/3/2008 in commit cde906b (although back then the function was called ShadowFile_cmd
). Why he did this I don't know. Perhaps he was thinking that the 'remove', 'comp' or 'chk' functions might take a while, and so in order to prevent a long-running command from preventing the user from being able to enter another command, he decided to process each of them in their own thread.
Personally I don't see any of the sf
commands as likely taking "a long time". Sure, they might not complete immediately, but none of them should take so long as to be a problem IMO. (Others may feel differently of course.)
IN ANY CASE, I am creating this GitHub Issue to record the fact that there is an as-yet an unidentified lock handling bug somewhere in Hercules's dasd caching logic in cache.c
that needs to be found and fixed.
OR...., we could of course implement the workaround permanently instead of temporarily.
FOR THE TIME BEING HOWEVER, I am implementing the proposed temporary workaround, by #defining OPTION_NOASYNC_SF_CMDS
in featall.h
so as to prevent Hercules from potentially crashing until the bug can eventually be found and fixed.
That is all.