Kaggle Freesound Audio Tagging 2019 2nd place code

Usage

Download the datasets and place them in the input folder.
Unzip the train_curated.zip and train_noisy.zip, then put all the audio clips into audio_train.
sh run.sh

requirements

tensorflow_gpu==1.11.0 numpy==1.14.2 tqdm==4.22.0 librosa==0.6.3 scipy==1.0.0 iterative_stratification==0.1.6 Keras==2.1.5 pandas==0.24.2 scikit_learn==0.21.2

Hardware

64GB of RAM
1 tesla P100

Solution

single model CV: 0.89763

ensemble CV: 0.9108

feature engineering

log mel (441,64) (time,mels)
global feature (128,12) (Split the clip evenly, and create 12 features for each frame. local cv +0.005)
length

def get_global_feat(x,num_steps):
    stride = len(x)/num_steps
    ts = []
    for s in range(num_steps):
        i = s * stride
        wl = max(0,int(i - stride/2))
        wr = int(i + 1.5*stride)
        local_x = x[wl:wr]
        percent_feat = np.percentile(local_x, [0, 1, 25, 30, 50, 60, 75, 99, 100]).tolist()
        range_feat = local_x.max()-local_x.min()
        ts.append([np.mean(local_x),np.std(local_x),range_feat]+percent_feat)
    ts = np.array(ts)
    assert ts.shape == (128,12),(len(x),ts.shape)
    return ts

prepocess

audio clips are first trimmed of leading and trailing silence
random select a 5s clip from audio clip

model

For details, please refer to code/models.py

Melspectrogram Layer(code from kapre,We use it to search the hyperparameter of log mel end2end)
Our main model is a 9-layer CNN. In this competition, we consider that the two axes of the log mel feature have different physical meanings, so the max pooling and average pooling in the model are replaced by one axis using max pooling and the other axis using average pooling. (Our local cv gain a lot from it, but the exact number is forgotten).
global pooling: pixelshuffle + max pooling in time axes + ave pooling in mel axes.
se block（several of our models use se block)
highway + 1*1 conv（several of our models use se block)
label smoothing

# log mel layer
x_mel = Melspectrogram(n_dft=1024, n_hop=cfg.stride, input_shape=(1, K.int_shape(x_in)[1]),
                           # n_hop -> stride   n_dft kernel_size
                           padding='same', sr=44100, n_mels=64,
                           power_melgram=2, return_decibel_melgram=True,
                           trainable_fb=False, trainable_kernel=False,
                           image_data_format='channels_last', trainable=False)(x)

# pooling mode
x = AveragePooling2D(pool_size=(pool_size1,1), padding='same', strides=(stride,1))(x)
x = MaxPool2D(pool_size=(1,pool_size2), padding='same', strides=(1,stride))(x)

# model head
def pixelShuffle(x):
    _,h,w,c = K.int_shape(x)
    bs = K.shape(x)[0]
    assert w%2==0
    x = K.reshape(x,(bs,h,w//2,c*2))

    # assert h % 2 == 0
    # x = K.permute_dimensions(x,(0,2,1,3))
    # x = K.reshape(x,(bs,w//2,h//2,c*4))
    # x = K.permute_dimensions(x,(0,2,1,3))
    return x
x = Lambda(pixelShuffle)(x)
x = Lambda(lambda x: K.max(x, axis=1))(x)
x = Lambda(lambda x: K.mean(x, axis=1))(x)

data augmentation

mixup (local cv +0.002, lb +0.008)
random select 5s clip + random padding
3TTA

pretrain

train a model only on train_noisy as pretrained model

ensemble

For details, please refer to code/ensemble.py

We use nn for stacking, which uses localconnect1D to learn the ensemble weights of each class, then use fully connect to learn about label correlation, using some initialization and weight constraint tricks.

def stacker(cfg,n):
    def kinit(shape, name=None):
        value = np.zeros(shape)
        value[:, -1] = 1
        return K.variable(value, name=name)


    x_in = Input((80,n))
    x = x_in
    # x = Lambda(lambda x: 1.5*x)(x)
    x = LocallyConnected1D(1,1,kernel_initializer=kinit,kernel_constraint=normNorm(1),use_bias=False)(x)
    x = Flatten()(x)
    x = Dense(80, use_bias=False, kernel_initializer=Identity(1))(x)
    x = Lambda(lambda x: (x - 1.6))(x)
    x = Activation('tanh')(x)
    x = Lambda(lambda x:(x+1)*0.5)(x)

    model = Model(inputs=x_in, outputs=x)
    model.compile(
        loss='binary_crossentropy',
        optimizer=Nadam(lr=cfg.lr),
    )
    return model

qrfaction/2nd-Freesound-Audio-Tagging-2019