2019 Google Landmark Retrieval Challenge 리뷰

Question

2019 Google Landmark Retrieval Challenge 리뷰

chullhwan-song opened this issue 5 years ago · 0 comments

chullhwan-song commented 5 years ago

2018 리뷰

2019

1st/3rd place solution by Team smlyaka

training set
- 주어진 데이터셋의 노이즈를 제거하기 위한 작업 시도 > cleaned dataset
- 관련 논문 : Large-scale Landmark Retrieval/Recognition under a Noisy and Diverse Datase
2018년처럼 다양한 네트워크 모델 적용
- FishNet-150, ResNet-101, and SEResNeXt-101 & cosine-based softmax losses, ArcFace and CosFace 로 학습.
  - FishNet-150 ? > 리뷰 포스트 링크
    - 형태로 보면, U-Net에 encoder를 더한 느낌의 network...
- cosine annealing, GeM pooling, and finetuning at full resolution on the last epoch with freezing Batch Normalization.
  - cosine annealing?
- 궁금??) 위의 매우 뎁스하면서 big 네트워크로..원본사이즈의 학습으로 finetuning 하면 GPU 메모리가 모자르지 않나.. 제경우에는 resize해서 batch size를 극도로 작게 해야 겨우 되던데..(k-80에서..ㅠ) > 이런 큰 Network를 학습하는 다른 노하우가 있을까??
For the recognition task, accumulating top-k (k=3) similarity in descriptor space and inliers-count by spatial verification helps a lot. > topk 후보를 뽑아놓고 verification한듯.. > 검출이 아니라 분류쪽 미션인듯..둘다 참가?
For the retrieval task, re-ranking (밑의 그림 ,4.1장)
- landmark classification 정보를 이용 - train labeling
- 이 classification 에서, test/index set를 모두 분류하여 이용
No use of PCA/whitening, DBA, QE, Diffusion or Graph Traversal.
Things doesn’t work
AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations
Combination of Multiple Global Descriptors for Image Retrieval
Label smoothing
SPoC, MAC, R-MAC, Compact bilinear pooling
Data Augmentation from AutoAugment paper (imagenet setting)
Warmup learning rate scheduling
Under Sampling
Remove distractor by Places365 indoor/outdoor classification
Meta classifier
Gaussian SARE
… and many other methods!

https://www.dropbox.com/s/y3c3ovdiizz59j4/cvpr19_smlyaka_slides.pdf?dl=0

2nd Place and 2nd Place Solution to Kaggle Landmark Recognition and Retrieval Competition 2019

https://arxiv.org/pdf/1906.03990.pdf > 논문형식으로 publish했다.

Retrieval method

global descriptor

backbone : ResNet152, ResNet200, SE_ResNeXt152, InceptionV4
- SE_ResNeXt152는 SENet+ResNeXt152인가?ㅎ
- arcmargin loss - 얼굴인식에서 도입 > arcface라든가..얼굴에서 쓰인 loss들이 많이 적용하는듯.
- 마지막 fc layer를 제거하고, average pooling layer 한후에, 2개의 fc layer를 추가함.
  - 이 두개의 fc는 첫번째는 512 dim, 두번재는 학습셋의 203094개 클래스에 대응
cleaning traininset > 여기서는 하지 않는다. 호~
448 size, SD
- 큰이미지가 작은 landmark에 대한 이익이 존재한다고 봄.
이후 metric learning을 한번더 함. 이라고 다연히 생각했는데, fig1로 보아 희안하게.. ResNet152만 둘다 따라서 했는듯(원래 내생각은 classification한후 이모델을 가지고 또한번 metric learning이 아닐까..라고..ㅎ)
- ResNet152
- Npairs loss
- 이때의 학습셋은, 2 stage에서의 학습셋 외, index셋에 대한 clustering 진행하여 적용.
  - index는 clustering하여 cleaning하는것 같은데, training set은 언급이 ??
  - 20w+4w : w가 정확히 모르겠는데 위의 갯수로 보아.. 약 4만개의 클래스 ? > 아무튼 이건 따로 arcmargin loss
총 6개의 deep feature = arcmargin(ResNet152, ResNet200, SE_ResNeXt152, InceptionV4)+(ResNet152_metric)+arcmargin(ResNet152_train_index)
NN search to build retrieval system.

local descriptor

특히하게 여기서는 delf를 쓰지 않는다.
SURF[2] and Hassian-Affine[11, 12] root sift[1] as our local feature method. > 이 3개를 다 쓰면..cost가..??
- invert file
  - k-means clustering with 512 centers > 512는 개인적으로 매우 coarse 한 느낌.이게 가능한 이유가 deep feature에서 후보를 매우 잘 찾기 때문인듯 보임.

Recognition model

to train the classification model > 4094044 images and 203094 classes
- ResNet152, InceptionV4
we also use the test and index data sets to retrieve the 4M train set with ResNet152 feature
- top 5안에 두개의 카테고리(0.85이상의 max score를 가진)가 존재하면 "max voting number"를 고려

Rank strategy

query expansion (QE) & * database augmentation (DBA)
- with classification re-rank & local feature re-rank. > 둘다 적용
상위 Top N개의 weighted 결합 - 이 weight를 결정할땐 ??
top 300개의 리턴하는 NN 결과에 대해서..

Experiments

Retrieval method

생략

8th Place Solution: Clova Vision, NAVER/LINE Corp.

[step 0.] Dataset Clustering
- cleaning작업은 train만 ..test/index는 아니다.)
- 이 작업도 clean 데이터를 만들기 위한 시도 - 방법에 대해서는 없고, stage 1 데이터셋을 추가하여 시도(성능향상.0.01 mAP)
- stage 2 데이터는 1보다 더 noisy하다. 또한 cleaning 작업...
- train.csv만 사용하여 train했다는 의미가 아니라, 아래와 같이..모두 test/index set 적용??
  - test/index set은 라벨자체가 없는데.. clustering을 통해 라벨을 부여했을까?? 궁금??

Finally we train our model using the datasets below:
TR1: train v1
TR2: train v1 + test/index cluster v1
TR3: train v1 + test/index cluster v2 (noise cluster)
TR4: train v1 + test/index cluster v1 + train v2 (garbage removed) + test/index cluster v2 (garbage removed),

[step 1.] Learning Representation
- training data: TR1, TR2, TR3, TR4
- backbones: resnet50, resnet101, seresneet50, seresnet101, seresnext50, seresnext101, efficientnet
- losses: xent+triplet, npair+angular
  - xent ??? > cross-entropy (Xent) loss > 이는 classification loss를 의미
- aggregation methods: GeM, SPoC, MAC,
  - GeM이 잘 woriking한다고 언급하고 있고
  - 시간이 없어서.. efficientnet 같은 경우..fine-tuning 제대로(?) 못했다..란 언급
[step 2.] Feature Ensemble
- seresnet50 / SPoC / TR4 / npair+angular
- seresnext50 / SPoC / TR4 / npair+angular
- resnet101 / GeM / TR1 / xent+triplet
- seresnext50 / GeM / TR3 / xent+triplet
- concat후 l2-norm
- "which results in 4 x 7 x 2 x 3 = 168 combinations (or more…) " 이란 문장이 있는데.. 이해가 ㅠ
  - 실험결과의 개수를 의미
    - 4(training data 타입개수) x 7(실험한 backbones 개수)x 2 (실험한 loss 두개)x 3(feature 종류에 따른 aggregation method) : 위의 참조
[step 3.] DBA/QE + PCA/Whitening
- DBA/QE : We found 10-nearest neighbors for each data points for this. Based on our model, we got 9~10% mAP increase at stage1.
  - stage 2에서는 ??
- PCA/Whitening > 1024 dim
[step 4.] DELF / Diffusion
- re-ranking이 그닦 효과가 없다는 ...언급.
개인적으로 이상한점은 "Detect-to-Retrieve: Efficient Regional Aggregation for Image Search" 선보인 landmark ROI query가 not working한다 언급되어 있는데, 이게 왜 working이 안되는지 궁금하다.

9th Approach by Visual Recognition Group (VRG) Prague

작년과 유사한 방법(작년 6위) : GeM 기반의 feature ? https://www.kaggle.com/c/landmark-retrieval-challenge/discussion/58482
backbone : ResNet50, ResNet101, and ResNet152 pre-trained on ImageNet
generalized-mean pooling (GeM), l2 normalization, a fully-connected (FC) layer, and a final l2 normalization. & contrastive or triplet loss > 작년과 동일한듯..
- We randomly sample at most 50 positive pairs per landmark, while the hard negatives are re-mined for each epoch of the training.
- The FC layer is initialized by the parameters of supervised whitening learned on such pairs.
- 작년 그림.
리사이즈 데이터 : with images resized to have the largest dimensions at most equal to 1024 (900 for ResNet152 due to memory restrictions.
이 코드를 이용 : https://github.com/filipradenovic/cnnimageretrieval-pytorch

Thank you!

I did not try mini-batch mining, in fact, I did not try much more than what there already
 is in our code online, because I was heavily time-restricted in the past few months. 
That being said, mini-batch would hardly work, as we use high-resolution images for training 
and we accumulate gradients per single image per tuple, compute backward, repeat that 
multiple times (number of tuples in batch) and do one update of weights. So, mini-batch mining 
would have forward computations that are wasted, and then I am not sure where is the
 benefit compared to epoch mining (maybe "fresher" descriptors?). One more thing is the small
 batch size that doesn't promote the mini-batch mining strategy, ie, we use 5 tuples in a batch, 
tuple being a query, positive and multiple negatives (depends on memory and network size).
To be very precise, we use the following parameters:

ResNet-50:  image size 1024, number of negatives 5, number of tuples in batch 5
ResNet-101: image size 1024, number of negatives 4, number of tuples in batch 5
ResNet-152: image size 900,  number of negatives 3, number of tuples in batch 5
Going bellow 3 negatives had bad effects, using image size down to 800 should be fine, 
actually even preferable now that I see that new test and index have a maximum image size of 800.

Descriptor extraction
- multi-scale query : 3 scales (scaling factors of 1, 1/sqrt(2), sqrt(2))
- aggregate한 feature 구성. l2 normalize, PCA whitening > 2048 dim
Search
- When using our trained networks with cosine similarity and nearest neighbor search the performance is:
  - adding R-MAC trained
  - 웃긴건 ResNet101-GeM이 ResNet152-GeM 좋고, 그냥 갔다 쓴 R-MAC 도 좋다 ㅎ

ResNet101-GeM: 0.200 / 0.178
ResNet152-GeM: 0.196 / 0.167
ResNet50-GeM: 0.184 / 0.163

ResNet101-RMAC: 0.181 / 0.150

concat[4096]: 0.213 / 0.186

Graph-based QE and Diffusion
- Explore-Exploit Graph Traversal for Image Retrieval
- Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations

11th : Share your approaches?

등수가 안나와서 리더보드를 찾아보니 11th
3개의 앙상블 > each has 512 in dim
- inception v3 + Atten&Gem pooling
- David's Xception with VLAD pooling
- SEResNet101 with VLAD pooling
DBA 적용 >image vector 품질 향상
QE > 3가지 모델의 concat vector를 이용하여..QE 적용
다양항 실험을 했지만, 그중 Gem 과 VLAD가 좋았다.
metric learning(Siamese networks)를 이용하여 좀더 향상.
자기네들의 실수는, stage 1의 데이터만 적용했다..stage2는 적용하지 못해서 성능 차이가 많았다.라고 언급.
test + index에서 대부분의 distractor 이미지를 제거를 하는 방법을 찾았지만, 실제로 DBA + QE 에서는 전혀 도움이 안됬다..라고언급. ??
QE는 매우 효과적이었다.라고 언급 하고 오픈 소스 링크 : https://github.com/fyang93/diffusion
ANN 사용문제 - ? "only that they shouldn’t use ANN at all even for large dataset, which unnecessarily increases computations." ANN은 대용량에서..사용하는것이 아닌가?? ANN 문제점??

2018 리뷰

2019

1st/3rd place solution by Team smlyaka

2nd Place and 2nd Place Solution to Kaggle Landmark Recognition and Retrieval Competition 2019

Retrieval method

global descriptor

local descriptor

Recognition model

Rank strategy

Experiments

Retrieval method

8th Place Solution: Clova Vision, NAVER/LINE Corp.

9th Approach by Visual Recognition Group (VRG) Prague

11th : Share your approaches?

16th