system_tests: A Python repository from ChristinaLK

This directory is intended for the creation of a testing suite designed to gain insight about the compatibilities of various GPU parameters,such as compute capabilitie, CUDA libraries, ML frameworks, etc. The goal is to design an automatic test suite that outlines which parameters are valid on CHTC's system, to inform researchers of proper parameter groupings.

02/18/22 Notes:

The parameters that warrant the most immediate testing due to their frequency of use include:
- Compute Capability
- Cuda Library Version
- Various Frameworks (such as TF, PyTorch, etc)

To design a test suite, I should have a list of the valid parameter choices, such as what compute capabilites etc actually make sense.

Compute Capabilities:
X.Y where X is the major version and Y is the minor version. In general it looks like minor versions are not incremented by any specific value. Below is a list of the major version and which architecture they run:

1.Y -- Tesla Architecture (introduced 2007) -- Not supported after lib 7.0
2.Y -- Fermi Architecture (introduced 2010) -- Not supported after lib 9.0
3.Y -- Kepler Architecture (introduced 2012)
5.Y -- Maxwell Architecture (introduced 2014)
6.Y -- Pascal Architecture (introduced 2016)
7.Y -- Volta Architecture (introduced 2017)
7.5 -- Turing Architecture (introduced 2018)
8.Y -- Ampere Architecture (introduced 2020)

Cuda Library Versions:
A list of all valid library versions found below.

1.0-1.1, 2.0-2.3, 3.0-3.2, 4.0-4.2, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 9.0-9.2, 10.0-10.2, 11.0.0-11.0.3, 11.1.0-11.1.1, 11.2.0-11.2.2, 11.3.0-11.3.1, 11.4.0-11.4.4, 11.5.0-11.5.2, 11.6.0

TF library versions:

1.0-1.15, 2.0-2.7

PyTorch library Versions:
https://github.com/pytorch/pytorch/wiki/PyTorch-Versions
0.3.0-0.3.1, 0.4.0-0.4.1, 1.0.0, 1.1.0, 1.2.0, 1.3.0-1.3.1, 1.4.0, 1.5.0-1.5.1, 1.6.0, 1.7.0-1.7.1, 1.8.0-1.8.1, 1.9.0-1.9.1, 1.10.0-1.10.1

2/22/22 Notes:
Currently working on python file for automatically generating submit files that can be used later for determining whether a job has run successfully

Final update for the day is that I'm struggling to get conda to work inside of each job. Under the current configuration, I'm installing conda inside each job and then using an environment.yml file to build the conda env. Running conda activate envname is erroring out because CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

$ conda init <SHELL_NAME>

However, running conda init bash before that requires a bash restart before changes can take effect. Will have to solve at a later point

2/23/22 Notes:
Conda environment is now working, and one of the final things to fix is the cuda compute capability range not working.
For some reason, when we do (7 <= Target.CUDAComputeCapability <= 8), it only grabs the first value

- Now working on automatic file checking to determine if success or not. The idea is to check the error file and the output file for each combo. If the output file ends with "success" and the error file is empty, we have success.
- Done for the day. For next time, should adjust wait time (group says ~25 hours might be most reasonable). Should also debug why my valid combos aren't getting picked up and written to my file by the post process stuff. Also need to implement true test of the various cuda library versions and figure out how to scale across framework versions

02/25/22 Notes:
Doing some debugging. Goal for the day is to get valid combos to print to file, then to fix cuda lib versioning and TF versioning.

-- One note, before I had tried to use whether or not the _err.err file had any length as a check for whether or not everything ran correctly. It appears that I can't do this, because files that do run correctly output some conda file handling info in this file. Reaching out on slack to see if anyone can explain when/how Condor decides to choose err.err or out.out

ChristinaLK/system_tests