makegirlsmoe/makegirlsmoe_web

The Training dataset

lllyasviel opened this issue ยท 51 comments

Is it possible to share the 31255 dataset?
31255 128p images will be not that big. It is very easy to upload it to Internet.
In fact we have some unique GAN training tricks and would like to help to improve the predicted image quality.
It will not be offensive to some illustration websites' lisense, if we only share the clipped 128p face images.In fact many other datasets is already shared without any lisense problems, including nico-opendata.
The dataset in your paper is described as prepared very carefully, so many reseachers are interested in it. And we do not want to bother to collect these and clip these and vectorlize these again and again.
Could you please do these reseachers a favor :)
"We hope our work would stimulate more studies on generative modeling of anime-style images."
And the dataset can helps a lot.

Sorry, I do not own the copyright of them. You should know that copyright related issues are very sensitive in Japan, I am in Japan and working for a company that may have future collaboration with those copyright owners. Publish training dataset online is almost impossible for us.

Could you release the urls, labels and bounding boxes to crop out?

@danielwaterworth I will take it into consideration. It will take some time to do this since all image should be checked for release purpose. (Like NSFW images must be removed)

@Aixile, Thanks, I appreciate your consideration and thanks for publishing the project!

@Aixile Is it possible to simply release the SQL query results on ErogameScape so others can crawl them by themselves? It appears that ErogameScape has blocked IPs from other countries.

@Aixile Thanks a lot!!

oh thank you

@lllyasviel I've already crawled all the images, but since the face detector is not very good, it requires huge manual efforts to clean up the dataset. Anyway, you could find the scripts to crawl the images here: https://github.com/shaform/GirlsManifold.

@shaform Face detection is not a problem in 2017 lol. How many pictures you have download? Can you give me a sample image of the dataset?

I think the problem is lbpcascade_animeface has a poor precision/recall, especially for male characters, the recall is lower than 40% on the default setting.
The lack of labeled dataset is an obstacle for building up a powerful anime-face detection model.
Weakly supervised methods might work, but I am not sure about their performance.

We can detect anime faces with or without mathine learning. With mathine learning, we can do it with or without training.
Without ml, traditional pattern recognition works well.
Without training, Illustration2Vec has labels related to eye or mouth. Hack the predicted eye result or mouth result, and then trace back to the input to get an activation map.
With training, .......

I see.
That belongs to what I mean by weakly supervised methods.
Anyway, we need experiments to prove the idea and the performance.

@lllyasviel

  • I've downloaded 48,144 images.
  • 37,556 of them are after year 2005.
  • 33,190 faces larger than 80x80 are detected from 37,556 images.

I am not sure about the exact parameters of the face detector, so the detected results might be a little bit different from the original settings. The parameters I used are in the repo I provided.

Sample detected results could be seen here: https://imgur.com/a/9Saf3, as you can see, some of them are false positives.

BTW, someone has tried to improve on anime face detection but failed: https://qiita.com/homulerdora/items/9a9af1481bf63470731a

@shaform Oh nowadays people rely so much on CNN and ML! It is a simple problem that can be tackled with traditional PR filters.... Maybe I wil write a sample code in c++ later...

I've tried to train the SRResNet-like architecture. While the discriminator is okay, once I use the SRResNet generator, the GAN stops learning and generates corrupted results. I am wondering if anyone has successfully replicated the results?

@shaform

  1. Depends on your situation.

If your generator does not even appear to learn, you should try a lower learning rate.

If your generator looks well, but the loss suddenly grows up to a very big value (about 1e6 in my situation) in less then 10 iterations, which leads the generator to crash.
In my experience, this phenomenon happens on DRAGAN sometimes.
I feel that the problem is due to the numerical precision of gradient penalty calculation.
This can be solved by loading the saved snapshot and restarting training from 1000~2000 iteration before the crash point. Anything changes the randomness would help you overcome the crash point and continue on training.
Using a lower learning rate can make the phenomenon happens less often.

  1. I apologize that I made a small mistake on my discriminator architecture in the published manuscript. The mistake is not critical, and I think it will not influence the performance.
    In my experience, the discriminator architecture is less sensitive as long as you add a gradient penalty term to enforce the lipschitz.

@Aixile Thanks! I'll try your suggestions.

if you ever played with alphagan you should find it is extremely hard to train.

Not with Bayesian calculated channels instead of casually stacked res blockes and violent GP.

@lllyasviel Thanks! Since I haven't read the paper yet, I am wondering whether the Bayesian calculated channels concept is already in that paper or if you are referring to other techniques?

It is a concept from the very beginning of the first GAN. The quantity of channels are critical for dcGAN and some GANs proposed by Google Brain. The channels in some cases should be limited to the dimentions of latent space vector because it is Bayesian. For example, the-state-of-art face generative model BEGAN has a G shaped as 1x1x128->8x8x128->16x16x128->32x32x128->64x64x128->128x128x128->128x128x3, it seems strange that the channels keeps 128 in the whole procedure but the BEGAN has very very impressive and incrediable results in face ganeration. Another example is hyperGAN which can generate 256x256 images without any label/mask/pair or other conditional hints, the secret of its success is also Bayesian calculated channels, and you can check it at their repo.

@lllyasviel Thank you. It's very insightful. I'll try it in my experiments.

The channels in some cases should be limited to the dimentions of latent space vector because it is Bayesian.

What do you mean by Bayesian?

Indeed I am also curious. It appears that HyperGAN is not using fixed number of channels: 128x128x3 -> 64x64x16 -> 32x32x32 -> 16x16x48 -> 8x8x64 -> 4x4x80. Is this still Bayesian?

yes.
You can try this to understand. replace 128x128x3 -> 64x64x16 -> 32x32x32 -> 16x16x48 -> 8x8x64 -> 4x4x80 to 128x128x3 -> 64x64x64 -> 32x32x128 -> 16x16x256 -> 8x8x512 -> 4x4x512(or 4x4x1024)->1x1x128(or 256 or 512). Then you should only have some noise maps as GAN results, whatever your training data is.
Because the training of MakeGirlsMoe is supervised by conditional hints, channels and layers or resnets can be stacked casually without so much consideration. But if all hints are removed, a proper channel quantity and depth is of critical importance. Some think the alphaGAN is difficulty to train because nowadays supervised GAN training makes people pay less attention to the channels and depth, resorting to resnet and GP.

@lllyasviel Thanks, but DCGAN uses 64x64x64 -> 32x32x128 -> 16x16x256 -> 8x8x512 -> 4x4x1024. So this is not the optimal settings?
Are there any guidelines to choose the number of channels? Is it enough to just make sure that the number of channels doesn't exceed the dimensions of latent space vector as indicated by your previous comment?

The channels in some cases should be limited to the dimentions of latent space vector

I am talking about the resolution of 256. DCGAN works in 64x64 in this architecture.
Oh it is my mistake. I mean replace the 256x256x3 -> 128x128x16 -> 64x64x32 -> 32x32x48 -> 16x16x64 -> 8x8x80 in hyperGAN

Not with Bayesian calculated channels instead of casually stacked res blockes and violent GP.

@lllyasviel The same question: what are you exactly referring to by mentioning "bayesian"?
It would be helpful to make explicit references to relative publications in the discussion.

I do not agree with you.
Vanilla dcgan can work on large images.

Here is a result I trained with vanilla dcgan.
192x192x3 -> 96x96x32 -> 48x48x64 -> 24x24x128 -> 12x12x256 -> 6x6x512

Because the training of MakeGirlsMoe is supervised by conditional hints, channels and layers or resnets can be stacked casually without so much consideration.

No, the model without conditional hints works well.
resblock is not critical, it can work without any resblock, but adding resblock make the quality slightly better in my experiment. As the resblock is only added to low resolution feature maps, it is not the bottleneck of the computation.
GP is critical because it can enforce the discriminator to be lipschitz and make the optimization problem well behavior.
But the form of GP is not important in my experiments.

I can only give some personal experience because many papers holds different ideas. But what is sure is that the channels must be taken into consideration.
For example, human face generation. Howcomes the channels of BEGAN is 128? why is 128 not 256 or 512? This can be tested in the following way:
step 1: make an encoder and a decoder, and link them with a 1x1xN layer. N should be large enough, for example 1024.
step 2: train the network to copy faces, use l1 loss.
step 3: you will see a blur face as output.
step 4: cut down the N by a number like 128 or 256 and train again.
step 5: repeat step 4 till the blur face begin to be very very blur and we can not see a figure of face.
Then finally in my experiment of BEGAN, the N is 128, the same value of Google Brain.
Then we know one thing, the human face can be coded into a 128 vector. (And if you think of it carefully you will find this is why BEGAN works terribly in lsun bedroom.)
Then we get an important number 128, the number is the source of a face, any generated things should be able to be traced back to the vector bayesianly. I mean every layer should not exceed this limitation, and layers can do decorations to the feature from previous layer, but we should not give a layer the ability to disorder the main latent features and disturb the next layer's bayesian backtracing. For example 1x1x128->8x8x4096->8x8x4096 is a bad choice.

Here is an example of Vanilla GAN, I trained on 256x256 images.

Where I use
8x8x1024 -> 16x16x512 -> 32x32x256 -> 64x64x128 -> 128x128x64 -> 256x256x32 -> 256x256x1

If you find the architecture not appear to converge, using lower learning rate can solve most problems.

@Aixile Wow, this is so realistic. It seems like we don't really need the complex StackGANs.

These are results of a dcGAN of 8x8x1024 -> 16x16x512 -> 32x32x256 -> 64x64x128 -> 128x128x64 -> 256x256x32 -> 256x256x1?

OK. these flowers are impressive and defeated me. I will delete my stackGAN and reimplement your architecture.

@shaform We have new results on end-end training of high-resolution images. I am working on the paper. Codes will be released after that.

@Aixile Thanks~ I am looking forward to read the paper.

and then have a war with stack GAN ++?
https://github.com/hanzhanggit/StackGAN-v2

The above DCGAN is trained with a learning rate starting from 0.0001, and decreasing with a delay rate of 0.8 every 3000 iterations after 30000 iterations. (Batch size 64)

@lllyasviel
As you mention HyperGAN, have you ever tried to reproduce their 256 results?
It seems that they did not mention how 256 results were trained.
I run their code with their default settings, which use selu and lsgan on 192x192 CelebA images, but the training is failed.

(As CelebA images have a width of 178, I center crop the 178x178 image, and upscale to 192x192. Since the full-scale images are used, the dataset image contains more noises from the background, which would make generating high-quality images harder. It seems that they didn't use the full-scale CelebA. )

Yelp, my friend forwarded this to me this morning, and I was totally blown away

holy shiiit.
image
And their G looks so nature and excellent without any flaring things.
Many things can be reconsidered now, including cGAN I think.

Almost all things they used are different from the gan literature...
Weight Scaling, Pixel normalization, Smoothed Generator Weight, additional regularization for WGAN-GP, which make it extremely difficult to catch up with their progress.
I quickly implement a Weight Scaling + Pixel normalization based 32x32 generator, but I failed to train it in an end2end manner.

I think the "additional regularization for WGAN-GP" is not so critical and the architecture can be reimplemented without it.
In my opinion, the "Weight Scaling, Pixel normalization, Smoothed Generator Weight" can be replaced by a very unreasonable method but it should works: Just lock the weights for different layers in different training procedures.
I am buzy with exams now but I will try it once I have time.

The main objective of these weight regulations, I think, is to avoid the trained weights being disturbed by newly initialized weights.
Then why not lock these weights directly? I have not tried it....
BTW, I read the paper again and I think maybe the results can be achieved directly without the "progressive", because their methods make no improvements to GAN system or G performance. Maybe the "progressive" does nothing other than a accelerator? Maybe without their methods I can also achieve these results in one year of training and I can get these results in 40 days with their methods?

The so called weight scaling (EQUALIZED LEARNING RATE) looks very similar to weight normalization, and someone has shown that WN works well with GANs(1). I personally dislike BN very much so I have been using WN with GANs for some time. I felt that when BN is removed, it appears more likely for GANs to encounter gradient explosion. Perhaps their pixel normalization mitigates this issue very well.

(1): https://arxiv.org/abs/1704.03971

@Aixile did you release the new code for training high-resolution images? I'm looking forward to seeing the paper. Thanks