LLNL/scr

SCR release testing for v3.1

Closed this issue · 7 comments

@hariharan-devarajan has volunteered to do some testing on corona.

He is also planning to create some documentation in: https://github.com/LLNL/scr/tree/develop/doc-dev/rst/developers

Here are my findings

For the serial test

The following tests passed:
        serial_test_api_restart
        serial_test_api_shared_file_restart
        serial_test_config_restart
        serial_test_api_multiple_restart
        serial_test_ckpt_restart

71% tests passed, 2 tests failed out of 7

Total Test time (real) =  51.81 sec

The following tests FAILED:
         12 - serial_test_ckpt_F_restart (Failed)
         14 - serial_test_ckpt_F90_restart (Failed)
Errors while running CTest

For the failed tests

test 12                         
    Start 12: serial_test_ckpt_F_restart
                                         
12: Test command: /usr/workspace/haridev/scr/build/examples/run_test.sh "srun" "-t 5 -N 1 -n 1" "./test_ckpt_F" "restart"
12: Test timeout computed to be: 1500
12: Run: srun -t 5 -N 1 -n 1 ./test_ckpt_F
12: At line 54 of file /usr/workspace/haridev/scr/examples/test_ckpt.F
12: Fortran runtime error: Cannot open file '/usr/WS2/haridev/scr/build/examples/timestep.8/rank_00000.ckpt': No such file or directory
12:                                                
12: Error termination. Backtrace:
12: #0  0x15555183b171 in ???                                                                  
12: #1  0x15555183bd19 in ???                                                 
12: #2  0x15555183c521 in ???
12: #3  0x155551a40288 in ???
12: #4  0x155551a4058c in ???
12: #5  0x4028f5 in test_ckpt_f
12:     at /usr/workspace/haridev/scr/examples/test_ckpt.F:54
12: #6  0x40407d in main
12:     at /usr/workspace/haridev/scr/examples/test_ckpt.F:162
12: flux-job: task(s) exited with exit code 2
12: mv: cannot stat '.scr': No such file or directory
6/7 Test #12: serial_test_ckpt_F_restart ............***Failed    5.16 sec


test 14
    Start 14: serial_test_ckpt_F90_restart

14: Test command: /usr/workspace/haridev/scr/build/examples/run_test.sh "srun" "-t 5 -N 1 -n 1" "./test_ckpt_F90" "restart"                                                                                             
14: Test timeout computed to be: 1500
14: Run: srun -t 5 -N 1 -n 1 ./test_ckpt_F90
14: At line 55 of file /usr/workspace/haridev/scr/examples/test_ckpt.F90
14: Fortran runtime error: Cannot open file '/usr/WS2/haridev/scr/build/examples/timestep.8/rank_00000.ckpt': No such file or directory                                                                                 
14:
14: Error termination. Backtrace:
14: #0  0x15555183b171 in ???
14: #1  0x15555183bd19 in ???
14: #2  0x15555183c521 in ???
14: #3  0x155551a40288 in ???
14: #4  0x155551a4058c in ???
14: #5  0x4028f5 in test_ckpt_f90
14:     at /usr/workspace/haridev/scr/examples/test_ckpt.F90:55
14: #6  0x40407d in main
14:     at /usr/workspace/haridev/scr/examples/test_ckpt.F90:158
14: flux-job: task(s) exited with exit code 2
14: mv: cannot stat '.scr': No such file or directory
7/7 Test #14: serial_test_ckpt_F90_restart ..........***Failed    5.10 sec

For the parallel test

The following tests passed:
        parallel_test_api_restart
        parallel_test_api_shared_file_restart
        parallel_test_config_restart
        parallel_test_api_multiple_restart
        parallel_test_ckpt_restart
        parallel_test_ckpt_F90_restart

86% tests passed, 1 tests failed out of 7

Total Test time (real) =  59.50 sec

The following tests FAILED:
         13 - parallel_test_ckpt_F_restart (Failed)
Errors while running CTest

Output of the failed test


13/15 Testing: parallel_test_ckpt_F_restart
13/15 Test: parallel_test_ckpt_F_restart
Command: "/usr/workspace/haridev/scr/build/examples/run_test.sh" "srun" "-t 5 -N 4 -n 4" "./test_ckpt_F" "restart"
Directory: /usr/workspace/haridev/scr/build/examples
"parallel_test_ckpt_F_restart" start time: Jun 05 12:04 PDT
Output:
----------------------------------------------------------
Run: srun -t 5 -N 4 -n 4 ./test_ckpt_F 
At line 54 of file /usr/workspace/haridev/scr/examples/test_ckpt.F
Fortran runtime error: Cannot open file '/usr/WS2/haridev/scr/build/examples/timestep.8/rank_00000.ckpt': No such file or directory

Error termination. Backtrace:
At line 54 of file /usr/workspace/haridev/scr/examples/test_ckpt.F (unit = 1)
Fortran runtime error: Cannot open file '/usr/WS2/haridev/scr/build/examples/timestep.8/rank_00001.ckpt': No such file or directory

At line 54 of file /usr/workspace/haridev/scr/examples/test_ckpt.F (unit = 2)
At line 54 of file /usr/workspace/haridev/scr/examples/test_ckpt.F (unit = 3)
Error termination. Backtrace:
Fortran runtime error: Cannot open file '/usr/WS2/haridev/scr/build/examples/timestep.8/rank_00002.ckpt': No such file or directory
Fortran runtime error: Cannot open file '/usr/WS2/haridev/scr/build/examples/timestep.8/rank_00003.ckpt': No such file or directory


Error termination. Backtrace:
Error termination. Backtrace:
#0  0x15555183b171 in ???
#1  0x15555183bd19 in ???
#0  0x15555183b171 in ???
#1  0x15555183bd19 in ???
#2  0x15555183c521 in ???
#2  0x15555183c521 in ???
#3  0x155551a40288 in ???
#4  0x155551a4058c in ???
#3  0x155551a40288 in ???
#5  0x4028f5 in test_ckpt_f
#0  0x15555183b171 in ???
	at /usr/workspace/haridev/scr/examples/test_ckpt.F:54
#4  0x155551a4058c in ???
#6  0x40407d in main
#1  0x15555183bd19 in ???
	at /usr/workspace/haridev/scr/examples/test_ckpt.F:162
#5  0x4028f5 in test_ckpt_f
#2  0x15555183c521 in ???
#3  0x155551a40288 in ???
#4  0x155551a4058c in ???
#0  0x15555183b171 in ???
#1  0x15555183bd19 in ???
#2  0x15555183c521 in ???
	at /usr/workspace/haridev/scr/examples/test_ckpt.F:54
#3  0x155551a40288 in ???
#6  0x40407d in main
#5  0x4028f5 in test_ckpt_f
#4  0x155551a4058c in ???
	at /usr/workspace/haridev/scr/examples/test_ckpt.F:162
	at /usr/workspace/haridev/scr/examples/test_ckpt.F:54
#5  0x4028f5 in test_ckpt_f
#6  0x40407d in main
	at /usr/workspace/haridev/scr/examples/test_ckpt.F:54
	at /usr/workspace/haridev/scr/examples/test_ckpt.F:162
#6  0x40407d in main
	at /usr/workspace/haridev/scr/examples/test_ckpt.F:162
flux-job: task(s) exited with exit code 2
<end of output>
Test time =   5.70 sec
----------------------------------------------------------
Test Failed.
"parallel_test_ckpt_F_restart" end time: Jun 05 12:04 PDT
"parallel_test_ckpt_F_restart" time elapsed: 00:00:05
----------------------------------------------------------

@gonsie The Fortran tests are failing. I don't know the language enough to look into the issues. Do u want me to give it a try either way?

Oh, wait. It does work. Not as a suite, but if I clean up all directories and re-run just the Fortran tests, it works :) .

I can confirm all these tests run on Corona.

@gonsie @mcfadden8 Do u want me to test anything else? Both serial and parallel tests work on Corona.

@hariharan-devarajan thanks for the testing. now that #537 is closed, please run the suite one more time.

@gonsie I just tested it and it works.

thanks!