I experimented with generating faces of cats using Generative adversarial networks (GAN). I wanted to try DCGAN, WGAN and WGAN-GP in low and higher resolutions. I used the CAT dataset (yes this is a real thing!) for my training sample. This dataset has 10k pictures of cats. I centered the images on the kitty faces and I removed outliers (I did this from visual inspection, it took a couple of hours…). I ended up with 9304 images bigger than 64 x 64 and 6445 images bigger than 128 x 128.
The DCGAN generator converges to very realistic pictures in about 2-3 hours with only 209 epochs but some mild tweaking is necessary for proper convergence. You must choose separate learning rates for D and G so that neither G or D become way better than the other, it’s a very careful balance but once you got it, you’re set for convergence! With 64 x 64 images, the sweet spot is using .00005 for the Discriminator learning rate and .0002 for the Generator learning rate. There’s no apparent mode collapse and we end up with really cute pictures!
High-Resolution DCGAN and SELU
All my initial attempts at generating cats in 128 x 128 with DCGAN failed. However, simply by replacing the batch normalizations and ReLUs with SELUs, I was able to get slow (6+ hours) but steady convergence with the same learning rates as before. SELUs are self-normalizing and thus remove the need for batch normalization. SELUs are extremely new so very little research has been done on SELUs with GANs but from what I observed, they seem to greatly increase GANs stability. The cats are not as good looking as the previous ones and there is a noticeable lack of variety (lots of black cats with similar faces). This is mostly explained by the fact that the sample size is N=6445 rather than N=9304 (I only trained the models on images bigger than 128×128). Still, some cats are pretty good looking and they are in higher resolution than before so I still consider this a success!
The WGAN generator converges very slowly (took 4-5h, 600+ epochs) and only when using 64 hidden nodes. I could not make the generator converge with 128 hidden nodes. With DCGAN, you have to tweak the learning rates a lot but you are able to see quickly if it’s not going to converge (If Loss of D goes to 0 or if loss of G goes to 0 at the start) but with WGAN, you need to let it run for many epochs before you can tell.
Visually, there is some pretty striking mode collapse here; many cats have heterochromia, one eye closed and one eye open or a weird nose. Overall the results are not as impressive as with DCGAN but then it could be because the neural networks are less complex so this might not be a fair comparison. It also seems to get stuck into a local optimum. So far, WGAN is disappointing.
WGAN-GP (An improved version of WGAN with regularization instead of weight clipping) might be able to deal with these issues. In the paper by Gulrajani et al. (2017), they were able to train a 101 layers neural network to produce pictures! So I doubt that training a cat generator with 5 layers and 128 hidden nodes would be much of a problem. The Adam optimizer also has some properties which lower the risk of mode collapse and getting stuck into a bad local optimum (ref). This is likely contributing to the problem with WGAN because it doesn’t use Adam while DCGAN and WGAN-GP both use it.
WGAN-GP (Improved WGAN)
The WGAN-GP generator converges very slowly (more than 6+ hours) but it does so with pretty much any settings. It’s working directly out-of-the-box without any tweaking necessary. You can increase or decrease the learning rate by a lot without causing many problems. So for this, WGAN-GP really has my appreciation.
The cats are very diverse and there is no apparent mode collapse so this is a major improvement on WGAN. On the other hand, the cats are very blurry looking, kind of as if you were looking at up-scaled versions of low resolutions pictures and I’m not sure why that is. This might be a peculiarity with the Wasserstein loss. I assume that using different learning rates and architectures would help. Further attempts at this need to be made, it certainly has a lot of potential.
LSGAN (Least Squares GAN)
LSGAN is a slightly different approach where we try to minimize the squared distance between the Discrimination output and its assigned label; they recommend using: 1 for real images, 0 for fake images in Discriminator update and then 1 for fake images in Generator update. A paper by Hejlm et al. (2017) suggests using instead: 1 for real images, 0 for fake images in Discriminator update but .50 for fake images in Generator update to seek the boundary instead.
I didn’t have the time to make some full runs with it yet but it seems to be quite stable overall and to output nice looking cats. Although it is generally stable, one time, the loss and gradients exploded and things went from cats to nonsense. You can see epoch 31 and 32:
So it’s not completely stable, it can break down really bad. Choosing better hyper-parameters for the Adam optimizer would help prevent that. You don’t need to tweak the learning rate as with DCGAN though and when it doesn’t break down (this might be rare), it seems to lead to good-looking cats.
Edit: The first author of LSGAN, Xudong Mao, sent me an example of LSGAN generating cats in 128×128 which shows that this approach can create reasonably good samples. You can see their results here:
Source code available on my GitHub.