Recipe for a good sampling strategy for SkWDRO#

Practice

Tip

Read the tutorial on SkWDRO to understand better this part.

Recall the formula for SkWDRO:

\[L_\theta^\texttt{robust}(\xi) := \lambda\rho + \varepsilon\log\mathbb{E}_{\zeta\sim\nu_\xi}\left[e^{\frac{L_\theta(\zeta)-\lambda c(\xi, \zeta)}{\varepsilon}}\right].\]

Its inner expectation \(\nu_\xi\) is cruxial to the estimation, as it sheds light on the a priori knowledge you will incorporate into the estimation of the true risk measure. We will answer the question of how to pick this crucial hyperparameter.

Use pre-built options#

The library relies on various pre-built combinations of distributions, covering modelisation with and without targets (more details on the wdro presentation).

Problem-specific examples#

Some samplers for specific robust models have been implemented to provide guidelines for inspiration of astute readers of their source code, as well as ready out-of-the-box options.

Almost all of the pre-implemented samplers rely on a normal distribution for the input data, with varying distributions for the targets (which, recall, are called labels in the code’s nomenclature in order to catter to machine learning applications). - For a Bernoulli law on targets \(\{-1, 1\}\), see skwdro.base.samplers.torch.ClassificationNormalBernouilliSampler. - To sample a dirac distribution on \(\xi_\texttt{labels}\), see skwdro.base.samplers.torch.ClassificationNormalIdSampler. - If the target’s space has less structure, use a normal distribution (e.g. for regression tasks, even though the name suggests otherwise), see skwdro.base.samplers.torch.ClassificationNormalNormalSampler. - If you have problems that do not have targets, you can draw inspiration from either skwdro.base.samplers.torch.PortfolioNormalSampler or skwdro.base.samplers.torch.NewsVendorNormalSampler. For an example without a normal distribution on the inputs, see skwdro.base.samplers.torch.PortfolioLaplaceSampler.

Cost-based sampling#

If your cost functional has been crafted with care, it might include a default sampling option defined as singleton of pair o:py:class:torch.distributions.Distribution`(s). In this case, defining a sampler is more natural and can be done from the cost’s methods through two helper classes.

class skwdro.base.samplers.torch.LabeledCostSampler(cost: TorchCost, xi: Tensor, xi_labels: Tensor, sigma, seed: int | None = None)[source]
reset_mean(xi: Tensor, xi_labels: Tensor | None)[source]

Reset the sampler instance parametrization. Must be overriden when a subclass is made to describe it.

Parameters:
xi: torch.Tensor

part of the parametrization of the sampler that concerns the input variables

xi_labels: torch.Tensor|None

part of the parametrization of the sampler that concerns the labels variables

sample(n_samples: int)

Override this method to make a custom sampling mechanism from scratch. It should output a pair of tensors for xi and xi_labels.

Parameters:
n_samples: int

number of samples to draw

Returns:
zeta: torch.Tensor

input samples drawn

zeta_labels: torch.Tensor|None

input targets drawn, if any

class skwdro.base.samplers.torch.NoLabelsCostSampler(cost: TorchCost, xi: Tensor, sigma, seed: int | None = None)[source]
reset_mean(xi: Tensor, xi_labels: Tensor | None)[source]

Reset the sampler instance parametrization. Must be overriden when a subclass is made to describe it.

Parameters:
xi: torch.Tensor

part of the parametrization of the sampler that concerns the input variables

xi_labels: torch.Tensor|None

part of the parametrization of the sampler that concerns the labels variables

sample(n_samples: int)

Override this method to make a custom sampling mechanism from scratch. It should output a pair of tensors for xi and xi_labels.

Parameters:
n_samples: int

number of samples to draw

Returns:
zeta: torch.Tensor

input samples drawn

zeta_labels: torch.Tensor|None

input targets drawn, if any

Build your own sampler#

The library exposes three main interfaces to make your own sampling strategy. What they will implement is a conditional measure of probability \(\nu_\xi\) that, given a reference \(\xi\) value (or batch thereof), will sample a batch of \(\zeta\) realisations.

Some background to make educated guesses#

This measure stems theoretically from the entropic regularization of the WDRO problem: the addition of the term \(\mathcal{D}_\text{KL}(\pi, \pi_0)\) in the objective function of the primal problem will, after some lagrangian duality induced manipulations (see [1]), produce the logsumexp structure of (1). In this procedure, a disintegration lemma must be used to split the reference generator \(\pi_0(\xi, \zeta)\) into two parts. Under heavy abuse of notation, it would write as follows.

\[P_{\pi_0}(\xi\in X\cap\zeta\in Y) = \underbrace{P_{\pi_0}(\xi\in X)}_{\hat{\mathbb{P}}^N(X)}\underbrace{P_{\pi_0}(\zeta\in Y|\xi\in X)}_{\nu_\xi(Y)}.\]

Note

A common wisdom is that usually, if \(\nu_\xi\) does depend on \(\xi\), it should better average to it i.e.

\[\mathbb{E}_{\zeta\sim\nu_\xi}[\zeta|\xi]\approx\xi.\]

This property is by no means necessary, but proves useful in verifying one of the assumptions from [1] which is strict feasability of the reference distribution (i.e \(\mathbb{E}_{(\xi, \zeta)\sim\pi_0}[c]<\rho\)). Otherwise, choosing carfully \(\nu_\xi\) such that it is independant of \(\xi\) may be relevant for some applications.

In general, one must consider well-posed problems in which \(\nu_\xi\) has good properties: it is not too far from \(\hat{\mathbb{P}}^N\) but it also explores far enough of it to provide robustification, i.e. gets closer to the true problem’s distribution \(\mathbb{P}\).

The common interface for samplers#

To build your own sampler, you can write a class that inherits skwdro.base.sampler.torch.BaseSampler, or to be more precise inherit one of the following classes:

class skwdro.base.samplers.torch.LabeledSampler(data_sampler: Distribution, labels_sampler: Distribution, seed: int | None)[source]
abstractmethod reset_mean(xi: Tensor, xi_labels: Tensor | None)

Reset the sampler instance parametrization. Must be overriden when a subclass is made to describe it.

Parameters:
xi: torch.Tensor

part of the parametrization of the sampler that concerns the input variables

xi_labels: torch.Tensor|None

part of the parametrization of the sampler that concerns the labels variables

sample(n_samples: int)[source]

Override this method to make a custom sampling mechanism from scratch. It should output a pair of tensors for xi and xi_labels.

Parameters:
n_samples: int

number of samples to draw

Returns:
zeta: torch.Tensor

input samples drawn

zeta_labels: torch.Tensor|None

input targets drawn, if any

class skwdro.base.samplers.torch.NoLabelsSampler(data_sampler: Distribution, seed: int | None)[source]
abstractmethod reset_mean(xi: Tensor, xi_labels: Tensor | None)

Reset the sampler instance parametrization. Must be overriden when a subclass is made to describe it.

Parameters:
xi: torch.Tensor

part of the parametrization of the sampler that concerns the input variables

xi_labels: torch.Tensor|None

part of the parametrization of the sampler that concerns the labels variables

sample(n_samples: int)[source]

Override this method to make a custom sampling mechanism from scratch. It should output a pair of tensors for xi and xi_labels.

Parameters:
n_samples: int

number of samples to draw

Returns:
zeta: torch.Tensor

input samples drawn

zeta_labels: torch.Tensor|None

input targets drawn, if any

These two templates leave for you to define only two methods: the constructor and a special skwdro.base.samplers.torch.BaseSampler.reset_sampler() method that defines how to change the parameters of \(\nu_\xi\) if \(\xi\) changes (resetting dynamically the torch.distributions.Distribution objects if needs be. But while it is not mendatory, one may rewrite to their liking some custom methods, including the central skwdro.base.samplers.torch.BaseSampler.sample() method used by the library.

Learning by examples: a case study on mixed-features WDRO#

Say now you want to implement a logistic regression model based on a mixture of continuous and discrete features as described in [2] which proposes, between other tools, a cutting-plane algorithm to solve WDRO formulation with these features. As a case study, we explain here how to use SkWDRO to approximate the solution of this problem.

Let’s consider that the problem is formulated such that the n_continuous_features+n_discrete_features features are concatenated in an input variable xi, and target labels are in xi_labels. Consider the discrete features to be first one-hot encoded and then recentered to \(\{-1, 1\}\), just like in the usual logistic regression from the documentation. Here is how we would build the sampler for such a problem in pytorch.

Constructor method for mixed features sampler#
 1import torch
 2from skwdro.base.samplers.torch import LabeledSampler, IsOptionalCovarianceSampler
 3import torch.distributions as dst
 4
 5
 6class MixedFeaturesSampler(LabeledSampler, IsOptionalCovarianceSampler):
 7    data_s: dst.MultivariateNormal
 8    labels_s: dst.TransformedDistribution
 9    discrete_features_s: dst.TransformedDistribution
10    """
11    This class samples both continuous and discrete features of the design space.
12    The inputs ``xi`` are assumed to follow a layout in which the continuous
13    features are arranged at the beginning of the features vector and the
14    discrete ones follow, with the labels treated separately as the
15    ``xi_labels`` variable.
16    The concatenation in ``xi`` of the features must be done on the last axis,
17    the categorical variables are encoded in {-1, +1} with -1 representing the negation
18    of the class at a given index while +1 represents a realisation of the class.
19    Any value between -1 and 1 may also be used to represent an unsure class. To allow
20    this, all the tensor must be encoded as one common floating type (e.g. float32).
21    """
22    def __init__(
23        self,
24        xi, xi_labels,
25        n_continuous_features,
26        n_discrete_features,
27        # Probability of switching a class
28        p,
29        *,
30        # reusing the same trick for covariance matrices as in the logreg
31        sigma = None,
32        tril = None,
33        prec = None,
34        cov = None,
35        seed = None,
36    ):
37        assert 0. <= p <= 1.
38        self.p = p
39
40        self.n_continuous_features = n_continuous_features
41        self.n_discrete_features = n_discrete_features
42        assert xi.size(-1) == n_continuous_features + n_discrete_features
43
44        xi_cont, xi_discr = torch.split(xi, [n_continuous_features, n_discrete_features], dim=-1)
45
46        # See the source code of IsOptionalCovarianceSampler to see how
47        # to specify covariance matrices
48        covar = self.init_covar(n_continuous_features, sigma, tril, prec, cov)
49
50        # Recycle code from usual logreg samplers
51        super().__init__(
52            # Continuous part of the input sampler
53            dst.MultivariateNormal(
54                loc=xi_cont,
55                **covar  # type: ignore
56            ),
57            dst.TransformedDistribution(
58               dst.Bernoulli(
59                   p
60               ),
61               dst.transforms.AffineTransform(
62                   loc=-xi_labels,
63                   scale=2 * xi_labels
64               )
65            ),
66            seed
67        )
68
69        # Discrete part of the input sampler
70        # Implements a random switch of the class indicator for each class of
71        # each discrete feature.
72        self.discrete_features_s = dst.TransformedDistribution(
73            dst.Bernoulli(
74                p
75            ),
76            dst.transforms.AffineTransform(
77                loc=-xi_discr,
78                scale=2 * xi_discr
79            )
80        )

The method that we should add to this class is the sampling procedure. It is composed of the sampling of the continuous input variable, its categorical part, and the perturbation of the labels.

Sampling strategy for mixed-features regression#
 1    def sample_labels(self, n_sample):
 2        """
 3        Samples the target labels through Bernoulli swaps.
 4        Overrides w/ ``sample`` to prevent ``rsample`` from crashing since bernoulli
 5        isn't reparametrizeable.
 6        """
 7        zeta_labels = self.labels_s.sample(torch.Size((n_sample,)))
 8        assert isinstance(zeta_labels, torch.Tensor)
 9        return zeta_labels
10
11    def sample_discrete_features(self, n_sample):
12        """
13        Samples the categorical (discrete) input features through Bernoulli swaps.
14        Overrides w/ ``sample`` to prevent ``rsample`` from crashing since bernoulli
15        isn't reparametrizeable.
16        """
17        zeta_discrete = self.discrete_features_s.sample(torch.Size((n_sample,)))
18        assert isinstance(zeta_discrete, torch.Tensor)
19        return zeta_discrete
20
21    def sample(self, n_samples):
22        """
23        Overwrite of LabeledSampler's method.
24        This is the function that will be called by the library internally to sample
25        the conditional distribution of inputs ``(zeta, zeta_labels)``.
26        """
27        zeta_cont = self.sample_data(n_samples)
28        zeta_discr = self.sample_discrete_features(n_samples)
29        zeta = torch.cat([zeta_cont, zeta_discr.to(zeta_cont)], dim=-1)
30        zeta_labels = self.sample_labels(n_samples)
31        return zeta, zeta_labels

Finally, we tackle the mendatory part linked to the reset of the sampler’s moments xi and xi_labels.

Sampling strategy for mixed-features regression#
1    def reset_mean(self, xi, xi_labels):
2        self.__init__(
3            xi, xi_labels,
4            self.n_continuous_features, self.n_discrete_features,
5            self.p,
6            tril=self.data_s._unbroadcasted_scale_tril
7        )

Testing the snippets above on some fake data

>>> # Gen some fake data
>>> xi_c = torch.randn((100, 3))
>>> xi_d = torch.randint(-1, 1, (100, 10))
>>> xi_d[xi_d== 0.] = -1
>>> xi_l = torch.randint(-1, 1, (100, 1))
>>> xi_l[xi_l== 0.] = -1
>>> xi = torch.cat((xi_c, xi_d), dim = -1)
>>> #
>>> # Test the sampler
>>> s = MixedFeaturesSampler(xi, xi_l, 3, 10, 0.1, sigma=0.1)
>>> print(s.sample(10)[0].shape)
torch.Size([10, 100, 13])
>>> s.reset_mean(torch.cat((xi_c, xi_d), dim = -1)*0.1, xi_l)
>>> print(s.sample(1)[0])
tensor([[[-0.0778,  0.2271, -0.3122,  ..., -0.1000, -0.1000, -0.1000],
         [-0.1395,  0.1225, -0.1504,  ..., -0.1000,  0.1000, -0.1000],
         [ 0.0092,  0.0816, -0.0500,  ..., -0.1000, -0.1000, -0.1000],
         ...,
         [-0.0248,  0.0565,  0.1841,  ..., -0.1000, -0.1000, -0.1000],
         [ 0.1627, -0.0517,  0.0814,  ..., -0.1000,  0.1000, -0.1000],
         [ 0.1598, -0.1213, -0.1780,  ..., -0.1000, -0.1000,  0.1000]]])

References#