d function can be learned by Siamese network
input: x
cnn, cnn, cnn, fc, fc (?)
which output a vector f(x) of 128 elements
you feed images x1 and x2 into it and get f(x1) and f(x2)
then d(x1, x2) = (|| f(x1) - f(x2) ||) ^ 2 (not sure about square)
(norm of difference of f1, f2)
learn params so that
if x1, x2 are same person, then d is small
if x1, x2 are different, then d is large