Connections between SVMs, Wasserstein distance and GANs

Check out my new paper entitled “Support Vector Machines, Wasserstein’s distance and gradient-penalty GANs are connected“! 😸 In this paper, we explain how one can derive SVMs and gradient penalized GANs (or those with Lipschitz-1 discriminator) from the same framework! We also show new gradient penalties that lead to better GANs.

This paper may completely change your perspective on the Wasserstein’s distance, Wasserstein GAN (WGAN), Hinge GAN (HingeGAN), and the use of gradient penalties in GANs. 🙀 At least, it did for me!

As I started preparing for my PhD comprehensive exam, I was learning about Support Vector Machine (SVM) and I thought to myself: “What would happen if we generalized SVMs to neural networks?”. In doing so, I discovered a link between SVMs, GANs and Wasserstein’s distance.

Most people familiar with Machine Learning have heard about SVMs. As it turns out, SVM is just a special case of what we call a Maximum-Margin classifier (MMC). MMCs are classifiers f which maximize a margin (the distance between the decision boundary of the classifier and a data-point). The decision boundary is the zone where we cannot tell what is the class of the sample (all x such that f(x)=0).  Soft-SVM is a special case where we maximize the minimum L2-norm margin1. See Soft-SVMs in action:

Before explaining the results, there is a key element to understand. There are multiple definitions of what is considered “a margin”. The margin is either defined as #1) the minimum distance between a sample and the boundary or #2) the minimum distance between the closest point to the boundary and the boundary. Normally, people assume definition #2; however, if we used this definition, what we call the “functional margin” and the “geometric margin” in SVM literature would not be considered margins! This can be very confusing! 😵 A better way to understand the difference is to consider #1 to be the margin of a sample and #2 to be the margin of a dataset. However, to disambiguate the two cases, I refer to the former as the margin and the latter as the minimum margin (min-margin).

Hard-SVMs (the original formulation) solves the problem of maximizing the minimum margin. Soft-SVMs solve the much easier problem of maximizing the expected soft-margin (the negative of the expected Hinge loss). This is much easier to solve and the hinge loss ensures that samples far away from the boundary don’t have any influence to try to pseudo-replicate the effect of Hard-SVMs. From this perspective, maximizing an expected margin rather a minimum margin still leads to a maximum margin classifier, but one that can be influenced by points far away from the boundary (if we don’t use the Hinge loss). Thus, maximizing the expected margin means maximizing the average distance between any sample (i.e, data point) and the decision boundary. These methods are examples of Maximum-Margin classifiers (MMCs).

To make things as general as possible, I devised a framework to derive loss functions for MMCs. I observed that margin-based objective functions (objective functions F of the form F(yf(x))) with gradient penalties could be derived from this framework. This means that Standard GAN-, Least-Squares GAN-, WGAN– or HingeGAN-GP (with gradient penalty) are MMCs. All these methods (when using the L2 gradient norm penalty as in WGAN-GP) maximize the expected L2-norm margin.

I also show that most GANs which assume a Lipschitz-1 discriminator (Spectral normalization HingeGAN, WGAN, WGAN-GP, etc.) can be represented as MMCs since assuming 1-Lipschitz is equivalent to assuming a bounded gradient (thus, a gradient-penalized formulation). Importantly, this means that the most successful GANs (BigGAN,StyleGAN) can be considered MMCs. Assuming a Lipschitz-1 discriminator has been thought to be a key element to obtain a good GAN, but it could have to do with having a discriminator which maximize a margin and a Relativistic Discriminator. We can explain the benefits of having an MMC discriminator by the fact that it brings more gradient signal to the fake generated samples (see paper for the details).

At this point, you might wonder: “Are certain margins better than others and if so, can we make better GANs?” The answer to both questions is yes! We know that loss functions which minimize a L1-norm are more robust to outliers than those which minimize a L2-norm (e.g., least absolute deviations vs least squares). Based on this fact, we suspect that L1-norm margins would lead to more robust classifiers and potentially better GANs than L2-norm margins (as we normally use). Importantly, an L1-norm margin leads to a L∞ gradient norm penalty and L2-norm margin lead to L2 gradient norm penalty. My experiments show that indeed, L∞ gradient norm penalties (which arise from using a L1 margin) generally lead to better GANs! 🙀

In addition, the experiments also show that HingeGAN-GP is generally better than WGAN-GP (which makes sense since the hinge loss is robust to outliers away from the boundary) and that only penalizing gradient norms above 1 (instead of fixing all gradient norm to be approximately 1, as done in WGAN-GP)2 generally works better. Thus, although this is a theoretical paper, we discover some really useful ideas on how to improve GANs.

Using this framework, I was also able to define the decision boundary and margin for Relativistic paired (Rp) and Relativistic average (Ra) GANs. People have often wondered why RpGANs didn’t perform as well as RaGANs. Read the paper if you want to know why 😉, it has to do with margins.

The idea of using an L1-norm margin is just the tip of the iceberg. This framework could be used to devise even better GANs through more robust margins (and thus better gradient penalties or “spectral” normalization techniques). It also finally provide a clear theoretical argument as to why it make sense to use gradient penalties or enforce 1-Lipschitz in GANs that don’t estimate the Wasserstein distance. I skipped over a lot of details so make sure to read the paper to know more. Let me know about any feedback you have and your results when using Linfinity-norm penalties.

Code is available on:

1 For more info on the different Lp-norms check out this great blog post.
2 There are previous papers also showing this: [1], [2].