FOTS: Fast Oriented Text Spotting with a Unified Network

Question

Opened this issue 5 years ago · 1 comments

Answer 1 · 2019-03-05T09:21:28.000Z

Abstract

Incidental scene text spotting 과제(우발적인 scene에서 존재하는 text 탐지, 캠으로 찍어댔거나..)를 가장 어려운 문제로 알려져 있음. (the most difficult and valuable challenges)
대부분의 과제들은 detection과 recognition를 분리하여 연구
이 연구에서는 end to end > Fast Oriented Text Spotting (FOTS) network 제안.
- simultaneous detection and recognition, sharing computation and visual information among the two complementary tasks.
특별히, RoIRotate라는 detection과 recognition 사이에 공유하는 feature를 소개
sharing computation 전략의 장점으로는,
- baseline text detection network와 비교해볼때, 적은 computation overhead
- joint training method에 의한 학습은 "detection과 recognition를 분리"하는 것보다 더 좋은 성능을 냄.
ICDAR 2015, ICDAR 2017 MLT, and ICDAR 2013 datasets outperforms
fast speed - 22.6 fps

end-to-end trainable framework
- fast oriented text spotting > 회전된 text 감지 그것도 아주 빠르게
- sharing convolutional features > detection과 recognition > real-time speed & little computation overhead
RoIRotate 란 개념 제시
- convolutional feature mas에서 oriented text regions를 추출하기 위한 새로운 differentiable operator
- end-to-end pipeline에서, detection과 recognition 를 통합.
text detection에서 outperforms

FOTS - end-to-end trainable framework
- four parts : shared convolutions, text detection RoIRotate, text recognition

전체 구조
shared convolutions
- ResNet-50 backbone -
- U-Net구조
- 1/4씩 down-sampling > 1/2이 아님.
text detection의 output를 이용하여, RoIRotate 에 적용함.
- converts corresponding shared features into fixed-height representations while keeping the original
  region aspect ratio. - CRNN의 input으로 사용하려는듯.
최종적으로 CNN-LSTM-CTC 구성인 인식과정 - CRNN과 거의 같음

영감을 얻은 연구와 EAST & Deep Direct Regression for Multi-Oriented Scene Text Detection
- 두 연구의 network 구조는 다르나 FCN 기반
Natural Scence에서는 small size의 text 많이 존재 그래서, upscale 할때, 1/32>1/4까지 (downscale할때 그렇게 했기 때문에..) > Fig.3
- in shared convolutions.
그후,
- 첫번째 channel > dense per-pixel predictions > text 인지 아닌지?
- 두번째 channel > EAST와 비슷하게 text(positive sample)를 shrunk > bounding box를 예측하기 위해, (top, bottom, left, right )에 대한 거리 예측,
- 세번째 channel > bounding box에 대한 orientation 예측 > word의 단위의 text가 기울어진정도..
- 이후, 합치고..NMS
loss : text classification & bounding box regression
text detection에서의 prediction/loss는 network를 제외하고(많이 유사하다..) EAST를 그대로 가져온듯~

text detection에서 획득한 region 즉, 회전된 text region을 align(수평으로 만든다는 의미)하여 feature map을 획득(axis-aligned feature map) > Fig.4
- height과 aspect ratio를 유지하면서..
for extracting features for regions of interest. > bilinear interpolation > avoids mis-alignments between the RoI and the extracted features ???
이러한 과정은 two step : affine transformation & interpolation > 통해 최종 feature map을 획득

why > detection 성능이 높은가?
- 이전 연구와 network와 알고리즘차이는 그닥없는듯..
- 학습셋 - transfer learning & 학습셋의 차이. ??
- 다음장에서 이에 대한 힌트가 될듯..,

ImageNet 기반 pre-trained model 이용
training process includes two steps
- Synth800k dataset 사용
- 리얼데이터 적용
  *위의 내용으로 보아 일단 두번의 transfer learning(fine-tuned) 사용했을을 알수 있음.
- ImageNet > Synth800k > 실제 target data
Data augmentation > real data
- 1th) longer sides of images are resized from 640 pixels to 2560 pixels
- 2th) images are rotated in range [−10도, , 10도] ] randomly
- 3th) the heights of images are rescaled with ratio from 0.8 to 1.2 while their widths keep unchanged.
- 4th) 640×640 random samples are cropped from the transformed images.
OHEM기반 학습