What is WDRO?#

Theory

Wasserstein Distributionally Robust Optimization (WDRO) is a mathematical framework that can provide robustness to data shifts, in machine learning models as well as many settings. Bellow is a contextualized explanation on its usecases, alternatives, foundations, and core advantages.

Machine Learning models#

Let us denote the cost (or “loss”) function \(L_\theta(\xi)\) associated with a prediction, parametrized by \(\theta\) for some uncertain data input \(\xi\in\Xi\). For instance, in linear regression, we have \(\xi=(x,y)\in\mathbb{R}^d\times\mathbb{R}\) with \(x\) the data and \(y\) the target. Then, one may pick as loss function the “Squared Error”, leading in average to the popular MSE estimator.

It is formally written as \(L_\theta(\xi) = \frac{1}{2}(y- \langle \theta , x \rangle)^2\). In general for regression, this will be framed as \(L_\theta(\xi):=\ell(y-\langle\theta, x\rangle)\) for some scalar function \(\ell: \mathbb{R}\to\mathbb{R}\) such that wolog \(\ell(0)=0\). On the other hand, for classification tasks, we have \(\xi=(x,y)\in\mathbb{R}^d\times\{0, 1\}\) with \(x\) the data and \(y\) the target/label. One may use the soft-margin loss function for logistic regression \(L_\theta(\xi):=\log\left(1+e^{-y\langle\theta, x\rangle}\right)\). In general for linear classification, as can be seen in the logistic regression case, the losses will instead be framed as \(L_\theta(\xi):=\ell(-y\langle\theta, x\rangle)\) for some scalar function \(\ell: \mathbb{R}\to\mathbb{R}\) such that wolog \(\ell(+\infty)=0\).

You can learn more about how you can use the library to build a PyTorch optimization model by reading through the user guide.

In machine learning, it is usual to train our model (or fit, i.e. optimize on \(\theta\)) using a finite amount of data samples \((\xi_i)_{i=1}^N\), by minimizing the Empirical Risk over this available dataset, which leads to the problem:

(1)#\[\min_{\theta} \frac{1}{N} \sum_{i=1}^N L_\theta(\xi_i)\]

Formally, one may view this dataset as an empirical distribution, that may be sampled at will through a query function (e.g. for PyTorch, which is favored by SkWDRO’s internals and interfaces, see their tutorial as well as the implementation to see how this query mechanism is used in practice). Equation (1) is usually called Empirical Risk Minimization (ERM) in the literature.

Robustness#

Robust optimization aims to provide decision-making with some level of resilience against uncertainty. Traditional approaches include:

  • defining an uncertainty set where the worst-case scenario is considered in a deterministic way on the whole set of possible outcomes \(\Xi\), or subsets thereof,

  • or assuming that the empirical distribution represents closely enough the true distribution of the uncertain variable (see (1)), hence trusting that resampling \(\hat{\mathbb{P}}^N\) multiple times will cover enough uncertainty.

However, these methods can be limiting due to difficulties in setting appropriate uncertainty sets, intractability of the associated optimisation problems, and issues with the representativeness of the available samples.

Distributionally Robust Optimization (DRO) introduces a solution by considering the worst expectation of the cost function when the probability distribution belongs to a neighborhood around the empirical distribution. This approach allows for a more flexible set of distributions, parameterized by a radius (denoted \(\rho\)) from the empirical measure.

The problem ends up being formulated as follows in the general case:

\[\min_\theta \sup_{\mathbb{Q}\in \mathcal{U}_\rho((\xi_i)_{i=1}^N)} \mathbb{E}_{\zeta\sim\mathbb{Q}}[L_\theta(\zeta)]\]

This formulation leaves to be chosen the size of the neighborhood \(\rho\) and the notion of “proximity” between the samples and the robust distribution \(\mathbb{Q}\).

Distributional robustness: divergences#

In order to define the ambiguity sets, various tools have been proposed [3] in the literature alongside WDRO, including but not limited to:

  • the \(KL\)-divergence,

  • the \(\chi^2\) distance,

  • the total variation distance,

  • the Wasserstein optimal transport cost, and variations thereof.

A popular technique is the use of \(\phi\)-divergences, i.e. distances of the form:

\[\mathcal{D}_\phi(\mathbb{P} \| \mathbb{Q}) := \mathbb{E}_\mathbb{Q}\left[\phi\left(\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\right)\right].\]

They are appealing for their apparent simplicity, and because a lot of commonly used divergences in other fields, such as statistics, admit this representation. E.g., for the \(KL\)-divergence \(\phi(s)=\begin{cases}s\log(s) - s + 1 & \text{if } s\ge 0\\ +\infty & \text{otherwise}\end{cases}\), for total variation \(\phi(s)=\begin{cases}(s-1)^2 & \text{if } s\ge 0\\ +\infty & \text{otherwise}\end{cases}\). See [5] §2.2 for a detailed explanation of the ambiguity sets related to these divergences.

Fortunately, there exist a recent line of work aiming to engineer those ambiguity sets in practice. For a general setting, have a look to the python library called python-dro. For Wasserstein-DRO, you are in good place, see below.

Note

About \(\phi\)-divergences.

These divergences have a common noteworthy property: the “true” (and inaccessible) distribution that the observed samples originate from, denoted \(\mathbb{P}\), lies out of the neighborhood around the empirical distribution \(\hat{\mathbb{P}}^N\) with probability 1 as long as it is absolutely continuous with respect to the Lebesgue measure. Depending on the nature of the problem at hand, this may be a real issue, which can be adressed by picking a different notion of neighborhood, as explained bellow.

Distributional robustness: optimal transport#

A recent push has been made towards wider neighborhoods notions for DRO than what we explored in our last chapter, lead by the WDRO formulation. Its ambiguity set relies on the Wasserstein optimal transport cost. Dating back to Monge’s “Mémoire sur la théorie des déblais et des remblais” (1781), the optimal transport mathematical framework sparked interest for incorporating geometrical properties into the notion of distance/divergence between distributions.

The transport cost is defined through a so-called “ground cost”, denoted \(c\), defining how much needs to be paid to transport an input \(\xi\) to any other valid input \(\zeta\in\Xi\) (usually such that \(c(\xi, \xi)=0\), and \(c(\xi, \zeta)>0\), \(\forall\zeta\neq\xi\), e.g. a distance function). Thus, given this geometrical insight on the studied space of uncertainty, the average transport cost \(c(\xi, \zeta)\) from reference samples \(\xi\sim\hat{\mathbb{P}}^N\) to their robust counterparts \(\zeta\sim\mathbb{Q}\) appears as a natural way to prescribe the transport cost from \(\hat{\mathbb{P}}^N\) to \(\mathbb{Q}\).

Def.: Transport plan.

A transport plan between two measures \(\mathbb{X}\) and \(\mathbb{Y}\) is a measure on the product of their supports \(\Xi^2\) that has as first and second marginals respectively \(\mathbb{X}\) and \(\mathbb{Y}\).

We denote by \(\Pi(\mathbb{X}, \mathbb{Y})\) the set of all these possible transport plans:

\[\Pi(\mathbb{X}, \mathbb{Y}):=\left\lbrace \pi\in\mathcal{M}(\Xi^2) \mid \int_{\zeta\in\Xi} \mathrm{d}\pi(A, \zeta) = \mathrm{d}\mathbb{X}(A)\text{ and }\int_{\xi\in\Xi} \mathrm{d}\pi(\xi, B) = \mathrm{d}\mathbb{Y}(B)\right\rbrace.\]

Later, we will denote the marginals respectively by \([\pi]_1\) and \([\pi]_2\).

Now that we have this notion of transport between distribution, we can recall the definition of the Wasserstein distance at the core of modern distributionally robust optimisation:

Def.: Wasserstein transport cost.

Let \(c\) a ground cost, and two distributions on \(\Xi\), \(\mathbb{X}\) and \(\mathbb{Y}\).

\[W_c(\mathbb{X}, \mathbb{Y}) := \inf_{\pi\in\Pi(\mathbb{X}, \mathbb{Y})}\mathbb{E}_{(\xi, \zeta)\sim\pi}\left[c(\xi, \zeta)\right]\]

Note

An interesting property of the transport cost with respect to its ground cost is that if \(c\) is a distance risen to some power \(1\le p\le\infty\), then \(\sqrt[p]{W_c}\) becomes a distance on the space of measures \(\mathcal{M}(\Xi)\). This yields the acclaimed Wasserstein distance:

Def.: Wasserstein distances of order \(p\), or “\(p\)-Wasserstein-distances”.

If \(\Xi\) is endowed with a distance function \(d_\Xi: \Xi^2\to\mathbb{R}^+\), then we call \(p\)-Wasserstein-distances the transport cost associated with the \(d_\Xi^p\) ground cost:

\[ \begin{align}\begin{aligned}\DeclareMathOperator*{\esssup}{ess\,sup}\\\begin{split}\begin{cases} W_p(\mathbb{X}, \mathbb{Y}) := \inf_{\pi\in\Pi(\mathbb{X}, \mathbb{Y})}\sqrt[p]{\mathbb{E}_{(\xi, \zeta)\sim\pi}\left[d_\Xi(\xi, \zeta)^p\right]} & \text{if }p\in\mathbb{N}^*\\ W_\infty(\mathbb{X}, \mathbb{Y}) := \inf_{\pi\in\Pi(\mathbb{X}, \mathbb{Y})}\esssup_{(\xi, \zeta)\sim\pi} d(\xi, \zeta) & \text{otherwise.} \end{cases}\end{split}\end{aligned}\end{align} \]

WDRO in a nutshell#

Considering what we noted about \(\phi\)-divergences, in that they are limited to rebalancing histograms thus lacking representation power, we may turn to the Wasserstein type of ambiguity sets, which are advertized as a premium option to improve generalization. This leaves as main problem the following:

(2)#\[\min_\theta \sup_{W_c(\hat{\mathbb{P}}^N, \mathbb{Q})\le\rho} \mathbb{E}_{\zeta\sim\mathbb{Q}}[L_\theta(\zeta)].\]

Several parts of the literature focus on providing a dual formula for (2), which holds under mild assumptions:

(3)#\[\min_{\theta, \lambda\ge 0} \lambda\rho + \mathbb{E}_{\xi\sim\hat{\mathbb{P}}^N}\left[\sup_{\zeta\in\Xi}\left\lbrace L_\theta(\zeta)-\lambda c(\xi, \zeta)\right\rbrace\right]\]

Its main advantage is the fact that it switched from a variational infinite-dimentional problem of finding a worst measure to a (usually) finite-dimensional problem of finding a worst input.

Now one must take note that the inner supremum of (3) is still to be taken with utmost care: if \(\Xi\) is not bounded, and \(f_\theta(\cdot)-\lambda c(\xi, \cdot)\) grows large by any means, then the problem is ill-posed. Note is also to be taken that even if the supremum is attained, it could be computationally intractable depending on the nature of \(\hat{\mathbb{P}}^N\), \(\Xi\), \(c\), and \(f_\theta\). Hence, even though this problem is easier than its primal counterpart, it needs more structure to be amenable to high-dimensional problems. See the next tutorial for more insights on this.

Some instances of reformulated WDRO problems#

In some cases, the WDRO problem may be reformulated into a convex finite-dimensional program, that one can solve with disciplined programming (e.g. the cvxpy python library). Many of those can be found in the seminal work of [4].

Model

loss structure

Input set \(\Xi\)

WD-Robust formulation

Source

Logistic regression

\(f_\theta(\xi):=\log(1+e^{-y\left\langle\theta\mid x\right\rangle})\)

\(\mathbb{R}^d\)

\(\min_\theta \rho \text{Lip}(f)\|\theta\|_* + \mathbb{E}_{\xi\sim\hat{\mathbb{P}}^N}\left[f_\theta(\xi)\right]\)

[2], [1]

SVM

\(f_\theta\) lipschitz, norm constraint on \(\theta\)

\(\mathbb{R}^d\)

Idem

[1]

Convex functions

\(f_\theta\) input-convex with any parametrization \(\theta\)

\(\mathbb{R}^d\)

\(\min_\theta \rho\kappa_\theta + \mathbb{E}_{\xi\sim\hat{\mathbb{P}}^N}\left[f_\theta(\xi)\right]\), (see more about [kappa])

[4]

Piecewise-affine (convex)

\(f_{A, b}(\xi):=\max_i(A\xi+b)_i\)

\(\{\xi|C\xi\le d\}\)

\(\begin{align}\min_\theta\inf_{\lambda, s_i, \Gamma_i\ge 0} & \lambda\rho + \sum_{i=1}^Ns_i\\ \text{s.t.} & A\xi_i+b+\Gamma_i(d-C\xi_i)\le s_i\mathbb{1}\\ & \|C^T\Gamma_i-A\|_{*, \infty}\le\lambda\end{align}\)

[4]

Piecewise-affine (concave)

\(f_{A, b}(\xi):=\min_i(A\xi+b)_i\)

\(\{\xi|C\xi\le d\}\)

\(\begin{align}\min_\theta\inf_{\lambda, s_i, g_i\ge 0, t_i\ge 0} & \lambda\rho + \sum_{i=1}^Ns_i\\ \text{s.t.} &\langle t_i | A\xi_i+b\rangle+\langle g_i | d-C\xi_i\rangle\le s_i\\ &\|(C^T\Gamma_i-A^Tt_i)_{i, :}\|_{*, \infty}\le\lambda\\ &\langle t_i | \mathbb{1}\rangle = 1\end{align}\)

[4]

[kappa]

Here the authors define a notion of growth rate \(\kappa_\theta\) reminiscent of [6], defined here as \(\sup_{\theta | f_\theta^*(\upsilon)<\infty}\|\upsilon\|_*\)

Note that the convex case is the most general, but it requires a good knowledge of the loss function through the \(\kappa\) constant.

What should I turn to?#

If you just want to rebalance the histograms of the dataset to robustify your model against lack of balanceness, you should try directly divergence-based DRO, as it is almost always easier to formulate.Otherwise, if you want to see the support of your samples be perturbated as well, between all the techniques available WDRO is one of the most promising ones in the litterature.

  • If the model you study admits a disciplined-programming reformulation as described above, you should implement it as is because it will usually be efficient enough.

  • Otherwise, if it is too complex, classical WDRO will usually be untractable as is. SkWDRO may thus turn out to be quite handy to still robustify your model, without turning your back on the WDRO framework, by smoothing its formulation.

Note

Whatever model you pick, empirically, adding complexity will almost always result in higher computational cost in this case. So as a rule of thumb, you should first try simpler robustification techniques that catter to your needs. Then add complexity layers if the framework you are considering does not handle your specific model, with our library handling the most general cases to the best of our knowledge, at the cost of some computation time.

Conclusion#

We saw that a lot of models are already well-studied through the lense of WDRO when it comes to robustness, but we lack techniques to robustify efficiently losses on which we lack knowledge (e.g. big neural nets). While it remains very relevant to some of the problems mentionned above, this questions its applicability in real world scenarii: in this library we propose to turn to a regularization technique to make the bigger and tougher models amenable to robustness.

See the next tutorial on Sinkhorn-WDRO to understand how we make it happen.

Next#

References#