Be Your Own Prada: Fashion Synthesis with Structural Coherence

Question

Be Your Own Prada: Fashion Synthesis with Structural Coherence

Opened this issue 6 years ago · 1 comments

chullhwan-song commented 6 years ago

http://mmlab.ie.cuhk.edu.hk/projects/FashionGAN/

Answer 1 · 2019-02-27T03:00:26.000Z

Abstract

GAN을 통해 새로운 옷을 사람에 입히는 방법제시.
input = { image, sentence } > multi-modal
- image : 입히려는 사람.
- sentence : 그에 관련 옷에 대한 설명(describing a different outfit)
이러한 input이 주어졌을때, 착용자와 그 포즈와 변하지 않게 다시(착용자는 이미 다른 옷을 입고 있는 상태) 입히는 방법
다음 그림을 보면 더 잘 이해함.
그에 묘사엔 outfit를 설명한 문장으로 그에 맞는 옷을 사람에게 입히는 이미지를 생성하는 것은 New task.
이를 위해, 이 연구에서는 " generative process"를 두 단계로 나누어 품.
- semantic segmentation map - 착용자의 입히는 공간 확보( latent spatial arrangement, effective spatial constraint)
- 제안한 Generator 모델로 , (semantic segmentation map에서 이용하여) 정확히 영역에 texture를 입힌 이미지를 rendering
DeepFashion dataset

Methodology

GAN based
- z is a random or encoded vector
- p_data is the empirical distribution of training images,
- p_z is the prior distribution of z
- maximum
  - G(x)는 p_data 에 가까워(distribution) 지려고하고.

Overview of FashionGAN

input (위와 동일)
- original image of a wearer
- sentence description of the new outfit.
  - ex) "a white blouse with long sleeves but without a collar"
goal : 묘사된 문장과 동일한 outfit를 입력 이미지에 그대로 입은 옷을 generator
scenario
- original image(I_0)에 입히기 보다. 먼저, person’s segmentation map(S_0)를 추출한다.
  - segmentation map> pixel-wise class labels > hair, face, upper-clothes, pants/shorts, etc
- wearer의 정보를 추출
  - vector of binary attributes(a)를 추출함. > person’s face, body and other physical characteristics.
    - ex) gender, long/short hair, wearing/not wearing sunglasses and wearing/not wearing hat.
  - 이들 정보로 부터, the mean RGB values of skin color, aspect ratio of the person, coarse body size..,
  - 이들 속성은 최종 output에서 보존되어함.
- text encoder(v)
- 정리하자면,
  - design coding = & human segmentation map S_0 가 실제 input
    - 밑의 식(3)
overall generative process
- the human segmentation (shape) generation (corresponding to the desired/target outfit) - 식(2)
- texture rendering. - 식(3)
  - 은 S_0의 low resolution representation
    - spatial constraint to ensure structural coherence
- 식(2)에서 보듯, 헤갈리는듯한데, 초기 human parsing(clothing parsing) 프로세스(이전에 이러한 연구들이 존재함)나 segmentation된 라벨된 map 정보 (S_0)이고, 이를 받아 새롭게 Generator하는게 식(2)의 의미인듯.. > 3.2장 Segmentation Map Generation 에서..
  - 초기이외에 또한번 정확한(?) segmentation map을 생성한다는 의미인듯..
- texture for each semantic part와 segmentation map과 일치 시키는 과정이 필요 > compositional
  mapping - 식(3)

Segmentation Map Generation (G_shape)

Generator 은 를 생성하는게 목적.
다시말하면, 는 S_0' segmentation 정보를 design coding d 맞는 segmentation 로 재 생산해내는것.

spatial constraint

이를 생성하기 위해서는 위에서 언급했듯이 3가지 입력값를 받아 생성, Fig.2
- S_0, design coding d, Gaussian noise z 가 spatial constraint의 입력값.
- S_0의 생성
  - the original image using a pixel-wise one-hot encoding > S_0 > 이는 라벨 L개의 nxm 사이지의 이미지들. > 이는 각 라벨 L마다 0, 1로 표현한 binary 이미지
  - 이 논문에서는 " L = 7 corresponding to background, hair, face, upper-clothes, pants/shorts, legs, and arms."
- 원 이미지의 입고 있는 정보는 불필요하고, 그 이미지안에 있는 사람의 정보만 capture.
- 그리고, spatial constraint의 역할은 구조적 일관성 보존(preserving structural coherence)이 중요한 룰임.
  - S_0 를 down-sampling & merge
    - S_0의 down-sampling를 사용하는 이유는 높은 해상도 정보가 그닥 를 생성하는데 좋지 못한 문제를 일으키는듯.. > 사실 와닿지 않지만.ㅠ
      
      * down-sampling의 사용은 S_0와 d 사이의 상관관계를 약화(contradict)시켜서..좋다?
      * S_0에서 사용하고 싶은 정보는 인간의 body shape정보등만 필요하고, design coding d는 실제로 S_0 segmentation과 그닥 일치하지 않아야 한다.
      * 왜이리 힘들게 써놓았을까 생각되지만, 일단, S_0 & d는 서로 상반되는 정보를 가지고 있다. 즉, d는 입히고자 하는 정보이고, S_0는 segmentation정보는 원래 입고 있는 정보이다. > 긴소매를 입히고 싶은데 S_0은 그렇지 않을수 있다...
촛점은 왜 S_0를 down-sampling하는 이유에 대해 ?? ==

Shape Generation

segmentation map은 design coding d와 일치하는 속성을 지녀야한다.
Generator하려고하는 human shape은 S_0에서의 지닌 pose 속성을 가져야한다.
G_shape를 Generator하기 위해 GAN 도입.
- generator and discriminator > convolution/de-convolution layers with batch normalization and non-linear operations
- Softmax activation function on each pixel & 라벨마다의

Texture Rendering (G_image)

는 입력이미지의 body shape과 design coding d의 속성과 일치된 Segmentation 정보이다.
이 정보를 바탕으로 Texture Rendering을 통해 원하는 최종 output 이미지를 Generator한다.
이때 다시 GAN을 사용 = FashionGAN
- 이때, 다시, , design coding d, Gaussian noise z 를 입력값으로 이용.

Compositional Mapping

segmentation map에 Guide된 이미지를 Generator > 원래 GAN은 region 정보와 상관없이 Generator.
- 이런 이유로, Compositional 이란 단어를 쓰는것같음.
이렇게 하면 더 좋은 성능
- the new mapping layer helps to generate textures more coherent to each region and maintain
  visibility of body parts.
segmentation map에 Guided되었다는 또 다른 의미는 각 Segmentation part 즉, 라벨된 category 영역 을 이용한다는 의미.

Image Generation

Shape Generation에서 사용하였던 GAN Network와 유사.
- 단, Tanh activation function

실험

Shape prior S_0
- One-Step-8-7: We use the down-sampled but not merged segmentation map (8×8×7) as the prior;
- One-Step-8-4: We use the down-sampled merged segmentation map (8×8×4) as the prior (the same setting we used in our first stage GAN).

결론

개인적으론 이 연구 결과가 너무 좋지 않다고 생각한다. 이전에 했던, Everybody Dance Now의 방식과도 연관되어있을거 같은데.. ProGAN이나 함께 해보면 이보단 좋지 않을까??