Please enable JavaScript.

Coggle requires JavaScript to display documents.

CS612: AI System Evaluation, Improving AI Robustness - Coggle Diagram

- - - - These perturbations are often domain- / task-specific
      - 🖼️ Images
        
        Change a limited number of pixels
        
        Rotate the image for certain degree
        
        Change the lighting
      - ✍️ Texts
        
        Replace a word with its synonym
        
        Introduce typo
        
        Add some word
  - - - L1-norm: Sum of the corresponding (absolute) pixel difference
      - L2-norm: Euclidean distance
      - L∞-norm: Maximum (absolute) pixel difference
      - An input is considered “close” if the norm is within certain threshold
    - - Word differences can be further defined based on Euclidean distance in the embedding space
  - - - White-box attacks: The attacker is assumed to have full knowledge of the model (e.g., can see the gradients)
        
        Szegedy’s L-BFGS attack:
        minimize c * ||x−x’||2 + Loss(θ, x’, t)
        
        I.e., Find an adversarial example x’ which is close to x and its loss to label t is small (and thus, the label is likely t)
        
        FGSM attack:
        maximize L(θ, x’, y), subject to ||x’ − x||∞ ≤ 𝝐
        
        I.e., Look for a change x’ that maximizes the loss to target y (i.e., label changes)
        
        Through one-step gradient descent as follows:
        x’ = x + 𝝐 * sign(∇xL(θ, x, y))
        
        If y is a target label, change the one-step gradient descent as follows:
        x’ = x - 𝝐 * sign(∇xL(θ, x, y))
        
        Same idea as...
        
        PGD attack:
        A multi-step version of FGSM
        
        C&W attack:
        minimize ||x−x’||2 + c * f(x’, t)
        
        Function f(x’,t) is the maximal difference L(i)-L(t) between any label i and the target label t. L(i) is the prediction confidence of label i.
        
        Text adversarial attacks:
        Given a text t, find t’ such that t’ and t are similar and L(t’) != L(t)
        
        Character-level perturbation (i.e., natural typos)
        
        Replace a selected word with synonyms
        
        Use LLMs (e.g., BERT) to replace or insert words
      - Black-box attacks: The attacker observes only the output of the model
        
        Local Substitute Model
        
        a. The attacker collects a small set of samples X
        
        b. The attacker decides on a model architecture
        
        c. The attacker queries the API for the prediction of X
        
        d. The attacker applies data augmentation to have more data
        
        e. The attacker trains a substitute model
        
        f. The attacker conducts white-box adversarial attacks
      - Physical attacks: The attacker is constrained to conduct the attack in the physical world
        
        The physical attack needs to be “robust” with respect to the camera angle, lighting or in the presence of noises and camera-processing.
    - - Probabilistic Verification:
        Given a sample x, if we sample inputs within certain region (i.e., Lp-norm) around x, what is the probability of having an adversarial sample?
        
        Pr(L(x’) != L(x)) subject to ||x’ − x||∞ ≤ 𝝐
      - Deterministic Verification:
        Given an input, how do we determine how large a change is minimally required to change the label?
      - Neural Network Verification
        
        :star: DeepPoly, abstracts each neuron with a tuple:
        < symbolic lower bound,
        symbolic upper bound,
        concrete lower bound,
        concrete upper bound >
        
        Take note of functions like ReLU
        
        For summation, remember to also expand the symbolic equations when finding the concrete lower / upper bound
        
        When faced with multiple choices, choose the more conservation value
- - - - BadNet:
        Works by “stamping” a selected backdoor trigger onto some selected benign images and labeling them
        
        The model accuracy is usually unchanged before / after data poisoning attack
      - Invisible Backdoor Attacks
        
        Select an arbitrary image as the trigger 🚩 (e.g., Hello Kitty)
        
        Randomly select multiple images x and generate a blended image: x’ = ɑ 🚩 + (1-ɑ) x
        
        Add (x’, yt) into the training set
      - Reflection Backdoor
        
        Select an arbitrary image as the trigger 🚩
        
        Manually select multiple images x with reflective surface and hide the trigger in the reflection
        
        Add (x’, yt) into the training set
      - Clean-Label Invisible Attack:
        Conduct backdoor attacks such that the labels seem correct
        
        Select multiple sample-label pairs (x, y) from the training set and apply PGD adversarial attack to generate adversarial sample x’ based on M
        
        Stamp x’ with the trigger pattern through backdoor trigger amplification and add (x’, y) into the training set
        
        Backdoor trigger amplification: Similar to the “invisible backdoor attack”,
        x’ = ɑ 🚩 + (1-ɑ) x
        
        Reasoning
        
        Adversarial attack is to confuse the model
        • Wrong non-robust features
        • Correct robust features
        
        Backdoor trigger amplification creates a special trigger feature
        • Model relies on the correct robust features and special trigger feature
        • If the special trigger feature is favored over the correct robust features, the backdoor trigger is successful
      - Semantic Backdoor
        
        Pick (no augmentation) some input images which share some high-level semantic feature
        
        Label all of these images with the target label
    - - Trojaning Attack:
        Identify a few neurons, and finetune the model so that such that there is strong correlation between the neurons and a chosen trigger
        
        Trojan Trigger Generation
        
        i. Decide on the shape of the trigger (e.g., Apple logo)
        
        ii. Choose one or more well-connected (useless) neurons as the target
        
        iii. Optimize the pixels in the trigger so that the selected neurons have strong activations in the presence of the trigger
        
        Training Data Generation
        (This step is to prevent catastrophic inference)
        
        i. Select an average sample (e.g., an average face)
        
        ii. Reversed image: Alter / Optimize the pixel values such that a prediction is generated with high confidence
        
        iii. Do this for all labels multiple times
        
        Fine-tuning the Model
        
        For each reserved image x, generate two training pairs (x, y) where y is the original prediction and (x + 🚩, yt). Fine-tune the model with these additional data.
      - TrojanNet
        
        Train a small neural network to recognize a particular (image or voice) pattern
        
        Add the small neural network into the structure of a given neural network.
        
        The output is determined by:
        ymerge = ɑ ytrojan + (1-ɑ) original, where 𝛼 > 0.5
    - - Universal Adversarial Perturbation (UAP)
        
        Untargeted UAP:
        Given a neural network N, identify a perturbation 𝛿 such that, N(x + 𝛿) ≠ y for most x and and 𝛿 remains imperceptible to humans
        
        Targeted UAP:
        Identify a perturbation 𝛿 such that, N(x + 𝛿) = yt for most x and 𝛿 remains imperceptible to humans
        
        Sample multiple inputs X from public domain
        
        Find UAP 𝛿 such that N(x + 𝛿) = yt for most x in X by:
        
        Optimizing the cross entropy -Σi(yi * log(pi)) where yi is the probability of label i according to the truth label and pi is the SoftMax probability of label i
        
        Optimization based on the logits — here, it refers to the last layer, before the SoftMax, that recognize the high-level features
        
        Apply the identified UAP with a small alpha, ɑ, to conduct a backdoor attack
    - - Take a set of images X (with slightly varying angle) of multiple people
      - Render / Photoshop the same glass frame onto the images
      - Attack all images in X simultaneously with adversarial perturbation that is limited to the glass frame.
        
        For the objectives of adversarial perturbation, let r be the perturbation within the glass frame, we aim to minimize the following:
        
        Loss(θ, x + r, yt) for all x in X, so that the prediction becomes the target yt
        
        ∑i,j((ri,j - ri+1,j)2 +(ri,j - ri,j+1)2)0.5, so that the variance between nearby pixels are small and the printer is able to print the glass frame well
        
        ∑i,j(ri,j - p) where p is a printable color closest to ri,j, so that the resultant color can be printed
  - - - Training-time input filtering (i.e., data filtering): If a training sample is determined to be suspicious (e.g., having weird patches, background patterns or wrong labels), remove it from the training set
      - Inference-time input filtering (i.e., output filtering): If a test sample is determined to be suspicious, do not serve it
      - Anomaly Detection
        
        a. Craft some abnormal samples
        
        b. Train a (decision tree) classifier
        
        c. If an input is deemed an anomaly, it is discarded
      - Principal Component Analysis (PCA)
        
        a. Train a model using all training data
        
        b. Conduct PCA analysis on all training data in each class
        
        c. Remove outliers in each class according to the PCA analysis
        
        d. Train a model using the remaining data
    - - Training-time input sanitization: Apply some form of transformation (hopefully to disable the trigger) before using the sample for training the neural network.
      - Inference-time input sanitization: Apply some form of transformation (hopefully to disable the trigger) before feeding the test sample into the neural network.
      - Pre-processing Through Autoencoder:
        Use an autoencoder to encode and decode the sample, and hopefully the trigger is disabled / made less effective somehow
      - STRIP:
        Assumption — If we modify an attacked sample, the prediction is like to remain the same
      - Februus:
        The logits (i.e., the second last layer of neurons, which are often regarded as the extracted high-level features) generated from benign inputs and attacked inputs are very different
        
        E.g., Using t-SNE:
        For the trojaned model, given a trojaned input, it will be mapped to the target class regardless of the features of the benign samples (assuming you have access to the training data)
    - - Re-training:
        Based on the phenomenon of catastrophic forgetting, fine-tuning the given model with some clean labeled data helps to eliminate the backdoor
        
        Collect clean samples (x, y) that is similar to the training dataset
        
        Train the given neural network with the cross entropy loss: Minθ LCE(θ, x, y)
      - Knowledge Distillation
        
        Approach 1: Given a model N (which is believed to be backdoored) and a set of self-collected clean samples,
        
        Finetune N with the clean samples to obtain a model M
        
        Take M as the teacher model and N as the student model and apply knowledge distillation
        
        Approach 2: The distillation is applied to optimize the student model (i.e., the backdoored model) with the objective of minimizing,
        
        The cross entropy on the clean samples
        
        And the difference between the teacher model and student model
      - Adversarial Unlearning of Backdoors
        
        Assuming there is a small set of clean samples (xi, yi) * n, retrain the given model with the following objective,
        Minθ Max||𝛿||<ε (Σi L(θ, xi+𝛿, yi)) / n
        where 𝛿 is intuitively an arbitrary trigger; ||𝛿||<ε defines the space of all possible triggers
      - Trigger Synthesis Based Defense
        
        If we know what is the trigger,
        
        We can simply discard those trigger-containing input either during training or testing
        
        We can remove the trigger and restore the image during testing
        
        We can precisely remove the backdoor through retraining or unlearning.
      - Neural Cleanse:
        If there exist a shortcut from the original to the target class, a backdoor exists
        
        For each target label yt, generate a minimal trigger 𝛿t through optimization which changes the label of any sample to yt: minɑ,𝛿 L(θ, N(x’), yt) + 𝜆*|ɑ|
        
        A generic trigger is defined as: x’ = (1-ɑ)x + ɑ𝛿
        
        If there exists one trigger 𝛿t which is significantly smaller (e.g., in terms of number of pixels or character changes) than the rest, there is a backdoor with target yt
        
        Using Median Absolute Deviation (MAD)
        
        Note: The Attack Success Rate needs to be above a certain threshold
        
        Remove backdoor
        
        Either disable neurons (at the logits-layer) that are most activated in the presence of the trigger
        
        OR, fine-tune the model with samples “stamped” with the trigger and the correct label (i.e., neural unlearning)
    - - Verifying Backdoor-Freeness
        
        Problem 1:
        Given a neural network N, a set of images X, a target yt and a maximum set of trigger pixels, the problem is to show that there does not exist a backdoor trigger 𝛿 such that N(x+𝛿) = yt for all x in X (assuming that trigger is 100% effective on X)
        
        Use abstract interpretation
        
        Problem 2:
        Given a neural network N, a set of images X, a target yt and a maximum set of trigger pixels, the problem is to show that there does not exist a backdoor trigger 𝛿 with a success rate of at least Pr
        
        Sufficient evidence accumulated, either many iterations of:
        
        No attack
        
        Low success rate
      - Certified Backdoor-Freeness Through Randomized Smoothing
        
        Assumption: The noise is big enough to overwhelm the trigger features (e.g., for invisible backdoor)
        
        Randomized smoothing can be used to certify backdoor-freeness: If we certify that N is robust given x (i.e., N(x) remains the same within certain radius of x), we can conclude that any backdoor trigger within the radius is ineffective (i.e., we certify the backdoor-freeness)
- - - - Disparate Impact:
        
        Idea: The proportion of the positive predictions is similar across groups
        
        Equation: Pr(y+ | A = a) / Pr(y+ |A ≠ a) >= 1 - ε, ε is often set to be 20% due to the “80% rule” (i.e., 1 - ε) in disparate impact law
      - Demographic Parity:
        
        Idea: The positive prediction is assigned to the two groups at a similar rate
        
        Equation: | Pr(y+ | A = a) - Pr(y+ |A ≠ a) | <= ε
      - Equalized Odds:
        
        Idea: The difference between the true-positive rates (TPRs) of the two groups
        
        Equation: | Pr(y+| A = a, y+) - Pr(y+|A ≠ a, y+) | <= ε
      - Equal Opportunity:
        
        Idea: The difference between the true-positive rates (TPRs) and false-positive rates (FPRs) of the two groups
        
        | Pr(y+| A = a, y+) - Pr(y+|A ≠ a, y+) | <= ε
        
        And, | Pr(y+| A = a, y-) - Pr(y+|A ≠ a, y-) | <= ε
        
        To satisfy Equalized Odds / Equal Opportunity, the model needs to be accurate. However, it’s unrealistic => Need to know the ground truth in advance
  - - - Adversarial Discrimination Finder (ADF)
        
        Clustering
        
        Global search
        
        i. Randomly select some samples from each cluster to improve diversity of the samples
        
        ii. Apply adversarial perturbation to each sample x according to the gradient (to maximize the loss of the current prediction) repeatedly until x is a discriminatory sample
        
        Local search
        
        For each discriminatory seed x, identify a non-protected feature which contributes little to the output label according to the gradient value => Perturb that feature to identify more discriminatory instances
      - Statistical Evaluation:
        show that the probability of having a discriminatory sample given a model N is below certain threshold, we keep sampling until we accumulate enough evidence to show that it is very likely below the threshold. I.e., Pr(x is discriminatory) < ε, where x follows a known distribution D.
        
        Approach: Apply statistical hypothesis testing, which is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis
        
        Set up two hypotheses:
        
        H0: Pr < ε
        
        H1: Pr ≥ ε
        
        If we sample n times and observe m times it is discriminatory, we can calculate the probability of H0 is true:
        
        If m/n ≥ ε (Type I error), reject H0 / accept H1
        
        m/n < ε (Type II error), not enough evidence to reject H0 / reject H1
    - - Statistical Evaluation:
        Similar to that of individual fairness, but check whether | Pr(y+ | A = a) - Pr(y+ |A ≠ a) | <= ε is true instead
      - :star: (Group) Fairness Verification:
        Construct a simple human understanding model and analyze whether the model satisfies desirable properties such as fairness. If the simple model is shown to be faithful with respect to the original neural network, and if the model shows that fairness is satisfied, so is the neural network.
        
        Relevant analysis:
        
        Reachability Analysis: Compute the probability of reaching certain states systematically (e.g., What is the probability of having a “Sunny” day in the next two days?)
        
        Sensitivity Analysis: Analyze the impact of certain distribution when it’s altered? (e.g., What if we change the distribution on “Rainy”?)
        
        See workings
  - - - Data Collection:
        
        Using random methods when selecting subgroups from populations
        
        Ensuring that all relevant variables are considered
        
        Ensuring that the subgroups selected are equivalent to the population at large in terms of their key characteristics
      - Feature Selection
        
        Selected features may fail to capture enough details
      - Proxies:
        Proxy variables (proxies) can be used when a certain feature is hard to measure directly
        
        Biases might be introduced through proxies — Membership in a protected class may be encoded in the proxy, called “redundant encodings”
      - Data Representation:
        Word embedding could be biased
        
        Use cosine similarity as a measure of correlation
    - - Pre-processing
        
        Suppressing:
        
        Remove the sensitive feature (e.g., gender)
        
        Further remove features that are strongly correlated with the sensitive feature, if necessary
        
        Suppressing is often not effective — Doing so may remove important information
        
        :star: Re-labeling:
        
        Among those samples which are predicted favorably, select the ones with least confidence for demotion
        
        Among those samples which are predicted not favorably, select the ones with highest confidence (for predicting the favorable label) for promotion
        
        Proposed for demographic parity
        
        :star: Re-weighting:
        
        Re-labeling may be considered intrusive since it alters the “truth”
        
        Re-weighting instead assigns different weights to the training data
        
        The weight of a sample x with sensitive feature value a and label y is defined as:
        (Pr(A = a) * Pr(y)) / Pr(A = a, y), where,
        
        Pr(A = a) * Pr(y) is the expected probability of any sample with A = a with label y (i.e., the label should be independent of the protected feature)
        
        Pr(A = a, y) is the observed probability
        
        Sampling:
        Some machine learning methods do not support weight naturally. Re-weighting can be thus realized through sampling.
        
        Compute the weights of each sample as in the case of re-weighting
        
        Treat the weight as the number of times the sample should be sampled
        
        Optimized Representation
        
        Given the raw dataset D, construct a new dataset D’ (e.g., by changing the sensitive features or labels) through optimization with three objectives:
        
        Maximize utility (i.e., accuracy): The distribution of non-sensitive features and the labels should be maintained if possible
        
        Minimize individual distortion: The change of each individual should not be dramatic
        
        Minimize discrimination: The label should be made independent of the sensitive feature
      - In-processing
        
        Fair Feature Selection:
        Systematically select the best suppressing features (i.e., subset of features)
        
        Solve the following optimization problem:
        Maxfs ⊆ S accuracy(Modelfs),
        subject (Modelfs is fair)
        
        Regularization
        
        Using the following objective function during the training:
        Minθ LCE(θ, x, y) + 𝜆 * R(θ, x, y)
        where, R(θ, x, y) is a regularizer which is defined according to the fairness property
        
        MinMax:
        Given a model N, learn a new model M such that M aims at maximizing its capability to predict the outcome while minimizing the capability to predict the sensitive feature
        
        Neural Network Repair
        
        a. Fairness Evaluation
        
        b. Causality Analysis
        
        :star: Average Casual Effect (ACE)
        
        c. Network Repair
        
        Identify the most responsible neurons (e.g. the top 10%)
        
        Apply the an optimization algorithm to optimize the weights of these neurons with the following objective function:
        Min (1 - α) UB + α (1 - accuracy)
        
        where, UB is a measure of unwanted behaviors and α is a weight
        
        Fair Representation Learning
        
        a. Learn a data producer fθ that satisfies two constraints:
        
        Samples that differ only by a protected feature are encoded similarly
        
        We are able to accurately predict the label after the data producer processes the data
        
        b. Apply certified training to train a robust data consumer that produces the same prediction for every similar samples.
      - Post-processing
        
        Individual + Group De-biasing
        
        a. Train a classifier to predict whether a sample is likely to suffer from individual discrimination
        
        b. If an unprivileged individual (e.g., a female) is predicted to suffer from individual discrimination, set the prediction to be the one that would be the case if the individual is privileged (e.g., change the prediction to be the one if she were a male)
        
        This method can improve:
        
        Individual fairness and demographic parity
        
        But not equal opportunity
        
        Cost-of-Fairness:
        To address demographic parity: Minimize the cost of accuracy whilst being fair
        
        Given a sample x where A = a with prediction y+ according to N, predict y+ only if Pr(N, x, y+) > Tha
        
        Use different threshold for different feature values / classes. E.g.,
        
        Female – Lower threshold
        
        Male – Higher threshold
        
        Equal Opportunity Predictor
- - - - Shadow Models: Trained on datasets that are similar to the original dataset (assuming that the attacker knows the structure and learning algorithm of the target model)
      - Inference on Shadow Models: Labeled as "member / non-member"
      - Attack Model Training: Using the predictions and "member / non-member" labels
      - Attack on Target Model: Query the target model and feed the output to the attack model
      - :star:If the model is overfitted to the training dataset, there's a higher risk of MIA
    - - Prediction Correctness-Based MIA:
        An attacker infers a sample (x, y) as a member if it is correctly predicted by the target model, otherwise the attacker infers it as a non-member
      - Prediction Loss-Based MIA:
        A sample is inferred as a member if its prediction loss is smaller than the average loss of all training members, otherwise it is inferred as a non-member
      - Prediction Distribution-Based MIA:
        An input is inferred as a member if,
        
        Its maximum prediction confidence is larger
        
        Its prediction entropy, Σ(py x log(py)) is smaller
        
        OR, its standard deviation is larger
        
        How do we set the threshold?
        
        Generate a set of random samples (images with random pixels / texts)
        
        The chance of these samples were in the training set is fairly low
        
        Use the top t-percentile value (say 5%) of the respective metric as the threshold.
      - Modified Prediction Distribution-Based MIA
        (onsiders the ground truth):
        mentr(x, y) = -(1 - py) log(py) - Σi≠ypi log(1 - pi)
  - - - The substitute model needs to have a similar model architecture to the target model
      - Two settings:
        
        The target model API provides confidence values
        
        The target model API only provides the label (sample inputs uniformly or pick those that are near the current decision boundary)
    - - Submit random text or wiki text as queries to the victim model
      - Finetune the vanilla BERT model with the query answers (with confidence)
  - - - Least significant bit encoding: Use the least significant bits of each parameter to memorize the data
      - Correlated value encoding: Add a loss to encourage “memorizing” data during training
      - Sign encoding: Use the sign (usually a useless bit) of each parameter to memorize the data
    - - Data Augmentation
        
        1) Let D be the data to be memorized and assume there are n classes.
        
        2) For every log2n bits of D, generate a random input (e.g., images with one non-zero pixel value or random sentence) using a deterministic algorithm and label it with the i-th class (where i is the value of the log2n bits).
        
        3) Feed the random input into the user's model and, based on the returned class label, build the image / text bits by bits
  - - - Confidence Score Masking
        
        Provide the label only
        
        The attack accuracy drops, but label is still sufficient for MIA
        
        Provide the top-K confidence only
        
        Most of the information are stored in the top k
        
        Round the confidence to a limited precision
        
        Add (adversarial) noise to the confidence before providing the confidence
        
        The noise is identified through optimization with the following objectives:
        
        The accuracy of the MIA classifier is reduced to a random guess — If the accuracy is reduced to 0, the prediction will be the opposite, which is not ideal
        
        The label remains the same
        
        The confidence score distortion is minimized — confidence score should be in the same range to be useable
        
        Train a (pseudo) classifier X for classifier-based MIA
        
        Given a confidence score vector, find some noise (through adversarial perturbation) so that X’s accuracy is reduced to a random choice while the model N’s accuracy is maintained
        
        Add the noise to the confidence score vector and return it to the user
      - Regularization:
        Techniques that aim to reduce overfitting thus could potentially help mitigate the risk of MIA
        
        Aim for models with fewer contributing neurons:
        
        L2-norm regularization
        
        Dropout
        
        Aim for models with fewer contributing features:
        
        Data augmentation — the model is forced NOT to learn from certain features
        
        Model stacking — Use multiple models so that only features common in different models contribute
        
        Early stopping
        
        Label smoothing
        
        Find the sweet spot!
        
        [+] Traditional regularization indeed seems to reduce the risk of MIA
        
        [-] Regularization may also reduce the model accuracy
      - Adversarial Regularization
        
        Train a model to satisfy two opposing objectives:
        
        Produce an accurate model, i.e., use as much information as possible
        
        Produce a private model, i.e., use as little (sensitive) information as possible
        
        Train the model by solving the following min-max optimization problem,
        minN( L(N) + 𝜆 * maxM G(M) )
      - Knowledge Distillation
        
        Train an unprotected teacher model N as per normal
        
        Obtain / Generate a new training dataset
        
        Identify a suitable set of unlabeled data — The training data shouldn’t be the same as the original data used to train teacher model N, but also not too dissimilar
        
        Label the identified data using teacher model N (i.e., predictions)
        
        Train a protected student model M using the newly labeled training dataset
      - (𝜀, 𝛿)-Differential Privacy:
        Add noise in a controlled manner such that we have some guarantee on the level of privacy that can be achieved
        
        Pr(M(E) = r) ≤ e𝜀 * Pr(M(D) = r) + 𝛿
        
        As ε and δ decreases, the MIA accuracy falls slightly but model accuracy also falls
        
        Systematically adding noises during the training process
        
        The algorithm satisifies (𝜀, 𝛿)-differential privacy by regularizing any significant samples
    - - Difficult to mitigate property inference attack because it’s difficult to keep the data distribution secure
    - - Do not provide an API
      - Provide the label only
      - Introduce noise in the prediction confidence
      - Refuse to answer weird queries
      - Limit the number of queries from malicious users
      - Detecting Model Extraction
        
        Assumption:
        Attacking queries are designed to extract maximal information and thus more likely not a normal distribution
        
        Use Shapiro-Wilk test and compare the difference against a threshold
      - Watermarking
    - - Causes:
        
        Overfitting may be a cause of model inversion attacks
        
        Lack of data variety may be a reason as well
        
        Out-of-distribution samples may be more vulnerable to model inversion
    - - White-box setting:
        
        Least significant bit encoding: Randomize bits (destroy the information)
        
        Correlated value encoding: Early stopping to prevent overfitting, Knowledge distillation to filter
        
        Sign encoding: Remove the sign
      - Black-box setting: Use an anomaly detection algorithm to detect abnormal samples (either during training or inference) on the user's trained model and filter them
- - - - Re-size
      - Random padding
      - Image cropping-rescaling
      - Bit-depth reduction
      - JPEG compression / decompression
      - Total variation minimization, e.g., drop some pixels and reconstruct them based on the surrounding ones
      - Image quilting, i.e., piecing together small patches that are taken from a database of image patches
      - Spell check — Detecting and correcting misspellings and unknown words to remove adversarial effects
      - Add noise
    - - Model Mutation Testing: Given a neural network N, slightly mutate N to generate a committee: N1, N2, …, Nk.
        Given a sample x, measure how often Ni(x) is different from N(x). The more often it is, the more likely x is adversarial.
        
        How to mutate a neural network?
        
        Gaussian fuzzing, i.e., add Gaussian noise to the neuron weights
        
        Weight shuffling and neuron switching, i.e., exchange
        
        Neuron activation inverse, change the activation status of a neuron
- - - - For adversarial training, we (under-)approximate the inner maximization problem
      - Note: Even with adversarial training, if the eps is higher than that of your adversarial samples, the attack will still be successful
  - - - Instead of approximating the inner max problem through attacking (as in adversarial training), we solve it using abstract interpretation.
        
        Repeat the process and solve for every sample in the training set => ensure every sample is negative
        Limitation: Not scalable
  - - - If the difference between the most frequent prediction and second most frequent prediction is large, we have certified robustness with respects to L2-norm