The goal of this project is to create an algorithm to identify metastatic cancer in small impage patches taken from larger digital pathology scans. The data used is a slightly modified version of the PatchCamelyon (PCam) benchmark set.
The data set was downloaded from https://www.kaggle.com/c/histopathologic-cancer-detection/data
- We see in the data frame description that there are 220,025 observations with all of them being unique.
- There are 2 columns, 'id' contains the matching id with an image in the 'train' folder and 'label' is a binary integer representing 0 as false (non cancerous) and 1 as true (cancerous)
- Currently, 'id' is an object data type but we can make that a string. 'label' is currently an integer but we can make that a factor
- In the 2 folders from the downloaded data set, there are 220,025 in the 'train' folder corresponding to the id's in the data frame. The 'test' folder has 57,458
- For value counts of 'label', we can see there are many more non-cancerous (0) labels than cancerous. 130,908 to 89,117, respectively.