Shao_Detecting_and_Grounding_Multi-Modal_Media

Shao_Detecting_and_Grounding_Multi-Modal_Media_Manipulation

自注意力层

Abstract

这篇论文提出一个新问题，如何检测并定位多模态媒体伪造

多模态是什么意思？

定位要涉及到什么基础

搞了一个新的数据集

提出一个HAMMER

这是啥，攻击的还是检测的

目的：充分捕捉不同模态之间的细粒度互动

实现

讲的都是啥玩意

浅层的操纵推理

深层的操纵推理

在图像编码器和文本编码器之间进行特殊的对比学习,目的是让他们对于由于操纵引入的细微差异更加敏感。这样可以获得比较基础的对操纵敏感的图像和文本表示

特殊的对比学习是什么？

通过多模态聚合模块,让图像和文本的表示深度交互,进行密集的融合。这样可以得到更高级的含有丰富操纵信息的多模态表示。

设计了不同级别的检测和定位模块。浅层的可以定位简单的操纵,深层的则可以进行更复杂的推理。这样,整个模型形成了一个分层的架构,逐步推理操纵信息,进行检测和定位。

Related Work

DeepFake Detection

Multi-Modal Misinformation Detection

Multi-Modal Media Manipulation Dataset

3.1. Source Data Collection

3.2. Multi-Modal Media Manipulation

Face Swap (FS) Manipulation（整脸替换）

Face Attribute (FA) Manipulation.（面部属性替换）

Text Swap (TS) Manipulation. （整个文本替换）

Text Attribute (TA) Manipulation（部分文本属性替换）

Combination and Perturbation

3.3. Dataset Statistics

Experiments

5.1. Benchmark for DGM4

Comparison with multi-modal learning methods.

Comparison with deepfake detection and sequence tagging methods

5.2. Experimental Analysis

Ablation study of two modalities.

Ablation study of losses.

Efficacy of LPAA.

Details of manipulation type detection.

Visualization of manipulation detection and grounding.

Visualization of attention map

Introduction

当前技术带来的问题

提出新的研究问题，多模态下既要检测也要定位

主要贡献

提出新问题

构建新数据集

建立新模型

浅层

深层

多重

当前工作

空间

频率

keep the appropriate pairs to form the source
pool O = {po|po = (Io, To)} for manipulation.

这个公式有什么意义

SimSwap

InfoSwap

生成伪脸

For each original image Io, we choose one of the two approach to swap the largest face Iof with a random source face Icelebf from CelebA-HQ dataset [14], producing a face swap manipulation sample Is.

生成定位

The MTCNN bbox of the swapped face ybox = {x1, y1, x2, y2} is then saved as annotation for grounding.

为什么挑这两种方法来生成伪脸

首先使用CNN获得原始表情

利用GAN-based methods, HFGI [47]
and StyleCLIP [28]获得目标表情

After obtaining the manipulated face Iemof, we re-render it back to the original image Io to obtainthe manipulated sample Ia. Bbox ybox is also provided.

使用NER网络获得主角名字

计算每个文本的语义表示，并且只选择那些与原始文本 "To" 在语义上不相似的文本 "To'"。这是为了确保检索到的文本在语义上与原始文本有所不同，以便进行后续的操作。

使用一个 M 维的 one-hot 向量 "ytok = {yi}Mi=1" 进行标注，其中 "yi ∈ {0, 1}" 表示第 i 个标记是否被操作或修改。这个过程是为了对 "Ts" 中的每个标记进行标记，以确定它们是否被修改。如果第 i 个标记被修改，则 "yi" 的值为 1，否则为 0。这个标注过程有助于跟踪文本中哪些部分发生了操作。

use a RoBERTa [24] model to split the captions into positive, negative and neutral sentiment corpora: {O+, O−, Oneu}.

将原始文本 "To" 中的所有情感词替换为由我们自己的语料库 {O+, O−} 上训练的 B-GST 模型生成的相反情感的文本，得到 "Ta"

combination

combine the obtained manipulation samples Is, Ia, Ts and Ta with the original (Io, To) pairs

forms a multi-modal manipulated media pool with full manipulation types: P = {pm|pm = (Ix, Ty), x, y ∈ {o, s, a}}. with a binary label y

a fine-grained manipulation type annotation ymul

aforementioned annotations ybox and ytok

a binary label ybin

ymul = {yj}4j=1 is a 4-dimensional vector denoting whether the j-th manipulation type (i.e., FS, FA, TS, TA) appears in pm

ybin describes whether the image-text pair pm is real or fake

perturbation(扰动）

50%随机

JPEG compression

Gaussian Blur

click to edit

HAMMER

整体构思

层次

深层

(LTMG) 定位操作后的文本标记

(LMLC) 检测细粒度的操作类型

(LBIC) 检测二进制类别

由 Multi-Modal Aggregator 生成的更深层的多模态信息

结合

L = LMAC + LIMG + LMLC + LBIC + LTMG

浅层

边界框定位

(LIMG)

语义对齐

(LMAC) 在图像和文本嵌入之间进行语义对齐

组成

grounding heads

Token Detector Dt

BBox Detector Dv

Multi-Label Classifier Cm

Binary Classifier Cb

dedicated manipulation detection

Multi-Modal Aggregator F

encoders

Image Encoder Ev

Text Encoder Et

4.2. Deep Manipulation Reasoning

Fine-Grained Manipulation Type Detection and Binary Classification.

Manipulated Text Token Grounding.

4.1. Shallow Manipulation Reasoning

Manipulation-Aware Contrastive Learning.

Manipulated Image Bounding Box Grounding

encoder

Img encoder

前馈网络

自注意力层

Text encoder

将图像转换为一系列嵌入以捕捉图像中的不同部分的信息

【CSL】标记

计算图像文本之间的相似度

vpat = {v1, ..., vN}

表示了图像的第i个patch的特征表示

[CLS] 标记通常用于捕获整个文本序列的概括信息

ttok = {t1, ..., tM} 则表示 M 个文本标记的嵌入。

通过跨模态对比学习实现图像和文本嵌入的对齐

作为整个图像的语义信息

不仅要推开不匹配，还要推开有伪造痕迹

loss函数计算

image-to-text contrastive loss

text-to-image contrastive loss

Manipulation Aware Contrastive Loss