Train the network with the following objective:
Minθ Maxd(x,x’)<ε, y’ != y L(θ, x’, y) - L(θ, x’, y’)
The idea is to minimize the maximal difference between L(θ, x’, y) and L(θ, x’, y’). If L(θ, x’, y) - L(θ, x’, y’) < 0 for any x’, we know we always have the correct label.
Instead of approximating the inner max problem through attacking (as in adversarial training), we solve it using abstract interpretation.
-
Repeat the process and solve for every sample in the training set => ensure every sample is negative
Limitation: Not scalable