Best solution

rank solution github author
3rd 3rd Place Solution Github code pudae
4th 4th Place Solution Github code Dieter
7th 7th Place Solution Github code Guanshuo Xu
8th 8th Place Solution Github code Sergei Fironov
11th 11st Place Solution Github code Gary, shisu
12nd 12nd Place Solution Github code Arnau Raventós
15th 15th Place Solution Github code NguyenThanhNhan
25th 25th Place Solution Github code Soonhwan Kwon
29th 29th Place Solution Github code zhangboshen
30nd 30nd Place Solution Github code Bac Nguyen
33th 33th Place Solution Github code Ildoo Kim

Overview

Training
  • Framework: PyTorch
  • Model: Densenet121
  • Data: kaggle data, external data
  • Augmentation: horizontalflip, verticalflip, rotate, shear, lighter/darker
  • Normalize: kaggle data and external data with different mean and std
  • Optimizer: SGD
  • Loss: Binary Cross Entropy Loss (no weight)
  • Learning rate: starts at 0.03 and ends at 0.00003
  • Scheduler: Multi Step Learning Rates (by my experience)
  • Data imbalance: OverSampling
  • Image size: 512
  • Batch size: 8
  • Epochs: 24
  • CV: 5-fold
Prediciton
  • Threshold: search the best threshold for each class with valida set
  • TTA number: 4
  • TTA augmentation: 2-horizontalflip x 2-verticalflip
  • Average mean ensemble: 5-flod * 4-TTA * 3-threshold = 60
Result
  • Training takes ~35 hours pre single fold on GTX 1070
  • Public LB: 0.566
  • Private LB: 0.546 (28th)

Observations

Work
  • Oversample is useful
  • External data helps a lot
  • 5 folds improve score by 0.02
  • TTA helps too
  • Put all images into SSD faster than HDD in training.
  • Threshold is crucial, and different threshold have a great influence on the LB score. The threshold I searched by valid set lowered my score at first. So I use the constant threshold (0.15) for a long time. The searched threshold for each class varies from 0.1 to 0.9 and different, and I cannot find any relationship between rare class and its threshold. I just want to give up to use searched threshold until I found that using smaller threshold would increase the score (not always). So I just multiply each searched threshold by a factor(~0.4). That mean I want predicted more target and get high recall. Although doing so may lower my f1 score, it does improve my public/private LB score. I think the reason may be that the wrong classification can be eliminated by tta and ensemble, and finally get more #TP and high score.
Didn't rowk
  • Weighted BCE loss work worse for me, and I have no time to make it work better.
  • Ensemble 256x256 with 512x512 lower my score, and I just discarded the result predicted by smaller images.
  • Leak data can only improve public LB, not helpful for private LB.
  • Complex models work badly, for example resnet152, densenet161.
Not sure
  • Adam doesn't work well on my model, it's most likely that I didn't find a suitable learning rate.
  • Weighted ensemble may work, but I think it is easy overfitting to public LB .
  • Split the tif images(2048x2048) into patches may helpful. I knew it too late, otherwise I will try it.
  • May be RGB work better than RGBY

Usage

  • 1.clone the repository
git clone https://github.com/feifei9099/kaggle_human_protein.git
cd kaggle_human_protein
  • 2.install requirements
conda create --name kaggle python=3.6
source activate kaggle
pip install numpy==1.15.4 torch==0.4.0 torchvision==0.2.1 scikit-learn==0.20.0 pandas==0.23.4 imgaug==0.2.6 tqdm==4.29.1 pretrainedmodels==0.7.4
conda install -c menpo opencv3 
  • 3.download data
kaggle competitions download -c human-protein-atlas-image-classification
python my_utils/download.py
  • 4.update config.py file to match your preferences
train_data = "path_to_your_train_data/train/"
test_data = "path_to_your_train_data/test/"
external_data = "path_to_your_train_data/external_data_HPAv18/"
test_csv = "path_to_your_sub_csv/sample_submission.csv"
train_csv = "path_to_your_train_csv/train.csv"
external_csv = "path_to_your_external_csv/external_data_HPAv18.csv"
  • 5.train you model 5 times (cv)
python main.py
  • 6.update config.py file and rerun main.py to predict
is_train = False
is_test = True
  • 7.ensemble submission file
python my_utils/kfold_cross_validation.py

Code Interpretation

  • Properly process external data is key to improve scores. The red, green and blue images directly extract the corresponding channels of original jpg images and save them into 512x512 gray png images. The yellow image combine R and Y channel in original image.
im = Image.open(DIR + img_name)
os.remove(DIR + img_name)
r, g, b = im.resize(image_size, Image.LANCZOS).split()
if color == 'red':
    im = r
elif color == 'green':
    im = g
elif color == 'blue':
    im = b
else:
    im = Image.blend(r, g, 0.5)
im.save(DIR + img_name2, 'PNG')
  • kaggle set and external set use two different mean and std, which can be calculated with the following code.
T.Normalize([0.0789, 0.0529, 0.0546, 0.0814], [0.147, 0.113, 0.157, 0.148]) # kaggle set
T.Normalize([0.1177, 0.0696, 0.0660, 0.1056], [0.179, 0.127, 0.176, 0.166]) # external set
color = ['red', 'green', 'blue', 'yellow']
files = np.array(os.listdir(paths))
files = files[np.random.choice(len(files), 5000, replace=False)]
mean = []
std = []
for c in color:
    allim = None
    for i, s in enumerate(tqdm(files)):
        if s.split('.')[0].split('_')[-1] == c:
            im = np.array(Image.open(paths + s))  # shape = (512, 512)
            im = np.expand_dims(im, axis=2)
            im = np.divide(im, 255)
            if allim is None:
                allim = im
            else:
                allim = np.concatenate((allim, im), axis=-1)
    m = np.mean(allim)
    s = np.std(allim, ddof=1)
    mean.append(m)
    std.append(s)
  • the oversampling weight is calculated by #class_target/#total_target or math.log(#class_target/#total_target). I use the log weight and put it in DataFrame's freq column
sampler = WeightedRandomSampler(train_data_list['freq'].values, num_samples=int(len(train_data_list)*config.multiply), replacement=True)
train_loader = DataLoader(train_gen,batch_size=config.batch_size,drop_last=True,sampler=sampler,pin_memory=True,num_workers=6)
  • average mean ensemble code shown as follows, which boost my score 0.02
for i, file in enumerate(sub_files):
    file_path = sub_path + file
    df = pd.read_csv(file_path)
    df = df.fillna('28')
    # df['pre_vec'] = df['Predicted'].map(lambda x: list(map(int, x.strip().split())))
    sub[i] = df
for p in range(len(sample_submission_df)):
    all_target = np.zeros((1, 28))
    for s in range(lg):
        row1 = sub[s].iloc[p]
        target = row1.Predicted
        target = list(map(int, target.strip().split()))
        target_array = np.zeros((1, 28))
        for n in target:
            if n == 28:
                continue
            target_array[:, n] = 1
        all_target += target_array
    all_target = all_target / lg > 0.5
    labels.append(all_target)
  • search threshold for each class
thresholds = np.linspace(0, 1, 100)
test_threshold = 0.5 * np.ones(28)
best_threshold = np.zeros(28)
best_val = np.zeros(28)
for i in range(28):
    for threshold in thresholds:
        test_threshold[i] = threshold
        score = f1_score(np.array(all_target), np.array(all_pred) > test_threshold, average='macro')
        if score > best_val[i]:
            best_threshold[i] = threshold
            best_val[i] = score
    print("Threshold[%d] %0.6f, F1: %0.6f" % (i, best_threshold[i], best_val[i]))
    test_threshold[i] = best_threshold[i]

For help

  • If someone have any questions and suggestions,please tell me!!!
    Email: scsncfb@126.com