None of the errors herein are mine, the editors are conspiring against me.
Instructions last updated 2024-02-27 These have not been tested for anything other than comparing exactly 2 sets of data at a time
The analysis starts with onest_analysis.py to perform the ONEST and/or CONTEST analyses on the CSV data.
Make sure to cache this to create the appropriate NPYs used in the other programs.
These instructions assume you have the commands git and python3.12 installed on your machine. If you do not have python3.12, you may download it directly or as a standalone program in most package managers. Furthermore, we assume you have a basic understanding of the terminal and navigation therein.
Open a terminal and navigate to the folder you want this repository to be under (for the sake of example, these instructions will call said folder root). From root, copy this repository with SSH with the following command:
git clone git@github.com:grepgrok/CONTEST.gitor with HTTPS:
git clone https://github.com/grepgrok/CONTEST.gitThis should create the following file structure:
root/
└── CONTEST/
├── .gitignore
├── LICENSE
├── README
└── ...
While not strictly necessary, it is highly recommended to run the code in this repository under a virtual environment. If you do not want to use a venv, skip straight to Install Dependencies. For more thorough instructions, see the official venv documentation. Here we give a basic overview of what is needed in a bash shell:
- Navigate to
root/CONTEST - Create the virtual environment: Run
python3.12 -m venv .venv - Activate it:
source .venv/bin/activate - Check that it was activated properly:
which pip3should printpath/to/root/CONTEST/.venv/bin/pip3
The file structure should resemble the following:
root/
└── CONTEST/
├── .gitignore
├── .venv/
│ ├── bin/
│ │ ├── activate
│ │ ├── pip3
│ │ ├── python3.12
│ │ ├── python
│ │ └── ...
│ ├── include
│ ├── lib
│ └── pyenv.cfg
├── LICENSE
├── README
└── ...
Note that file structure diagrams following will be rooted in
CONTESTand only include the files relevant at that point in the instructions (typically excluding.venv)
Once you are done with executing the code in this repository, you may execute the command deactive to deactivate the environment. The environment may be reactivated at any time with the command in step (3) above.
From root/CONTEST, install the requirements with the following command:
pip3 install -r requirements.txtIn the terminal, navigate to the CONTEST folder. From here on, subsequent terminal commans will assume the terminal is currently in the CONTEST folder.
- Place
.csvdata files in the same folder. This folder should be in the folder./data/.- For the sake of example, these instructions will call the folder
my_dataand the filestreatment.csvandcontrol.csv.
There are a few assumptions made about the datasets that should be ensured ahead of time:
- They have the same dimensions (e.g. both are 240 cases by 20 observers) with each row being a "case"
- The file only contains numbers in CSV format. Remove any column and row labels.
- Remove anything that is otherwise not data on how the observers graded the cases. Remove any data on ground truth, model prediction, treatment type, etc. from the
.csvfiles.
- For the sake of example, these instructions will call the folder
- Create a
./results/my_datafolder.
- Here is an example of the file system structure so far:
CONTEST/ ├── data/ │ └── my_data/ │ ├── treatment.csv │ └── control.csv ├── results/ │ └── my_data/ ├── README ├── onest_analysis.py └── ...
- Open
onest_analysis.pyand scroll to theget_argsfunction on line 15.- Change the
dataset_nameson lines 18 and 19 to the paths to the two files. Be mindful of the fact that Python requires a comma,at the end of the first file path on line 18."dataset_names": [ "./data/my_data/treatment.csv", "./data/my_data/control.csv" ]
- Adjust the
colorsandlabelson lines 22 through 29 as desired. These accordingly control the color and label associated withtreatment.csvandcontrol.csvon the plot of the analysis.The values under
colorsmust be named matplotlib colors. The values underlabelsmay be any strings. - Set
methodtoonestorcontestfor the ONEST or CONTEST analyses accordlingly. Note that the subsequent analyses below require this to becontest. - If you would like to plot all manifolds of the analysis, set
describeon line 31 toFalse. Conversely, settingdescribetoTruewill show only the minimum, maximum, and mean of the envelope in the ONEST method and only the minimum and maximum in the CONTEST method. - Make sure
cacheis set toTrue - Advanced: The number of unique manifolds may be adjusted: change the default value for
unique_curveson line 220 for ONEST; changeunique_surfaceson line 265 for CONTEST. Larger numbers may be more accurate but will take longer to compute. See Thoughts and Notes below for commentary on this.
- Change the
- In the terminal, run the following command:
python onest_analysis.pyThis may take some time, feel free to get some coffee.
The file system structure should now look something like this:
CONTEST/
├── data/
│ └── my_data/
│ ├── treatment.csv
│ ├── treatment.npy
│ ├── control.csv
│ └── control.npy
├── results/
│ ├── my_data/
│ └── onest.png
├── README
├── onest_analysis.py
└── ...
Subsequent analyses especially require the prescence of the .npy file created by this analysis: ./data/my_data/treatment.npy and ./data/my_data/control.npy. Also, they assume the executed analysis was the CONTEST analysis ("method": "contest" above in Step 3.3).
Since running the analysis can take a lot of time, you can re-run the plotting after obtaining the cached .npy files by simply replacing the .csv with .npy in the dataset_names on lines 18 and 19. The program will skip the analysis part, significantly speeding up graphical analysis.
As described in the [original paper][onest-paper], the ONEST/CONTEST manifolds are (theoretically) random and unique permutations of the observers (and cases). This means that, for 20 observers and 240 cases, there should be onest_analysis.py:102). Also, if the number of unique manifolds is set greater than the factorial of one less than the number of observers, the code is liable to enter an infinite loop trying to find the next set of observers for a surface.
The graphing of the CONTEST analysis can be very odd and glitchy. This is a documented artifact in matplotlib and it is suggested to use Mayavi (we have not made this switch).
-
Walk through the data folder for the data
-
Do better
get_args -
Automatically create CONTEST surfaces in alpha.py:get_data
-
Choose consistent way to convert list to ndarray (note when other is necessary)
np.arraynp.asarraynp.empty-> fill
-
Choose consistent way to execute function over ndarray (
alpha.py:289,alpha.py:144,alpha.py:144) -
Choose consistent way to identify assissted/unassisted or treatment/control
-
Choose consistent style of docstring
-
Figure out a consistent style of execution workflow (or decide to give up on it)
-
Add detailed docstrings with parameter and return types
-
Make sure eveything sends to the same results directory
-
Add notes about which sample data files to use in comparison
-
Write up instructions on running some data from the beginning
-
Get dad to run PDL1 from instructions I write up
We calculate the OPA as the proportion of the number of observer agreements to total number of cases. There may be multiple ways to calculate this. The FDA discuss overall percent agreement in a 2-class positive vs. negative context.