Focal Loss

Loss function

dense object detectors

factor (1 − pt)γ

standard cross entropy criterion

more focus on hard, misclassified examples

easy background examples

object detectors

two-stage approach

sparse set of candidate object locations

one-stage detectors

regular, dense sampling of possible object locations

have the potential to be faster and simple

have trailed the accuracy of two-stage detectors thus far

extreme foreground-background class imbalance

why?

extreme foreground-background class imbalance encountered
during training of dense detectors is the central cause

Class imbalance

Address

reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples

downweight wellclassified examples

focus training on a sparse set of hard examples

prevents the vast number of easy negatives from overwhelming
the detector during training

Evaluate

To evaluate the effectiveness of our loss

RetinaNet

a simple dense detector

able to match the speed of previous one-stage detectors

while surpassing the accuracy of
all existing state-of-the-art two-stage detectors

click to edit

state-of-the-art

two-stage, proposal-driven mechanism

first stage generates a sparse set of candidate object locations

As popularized in the R-CNN framework,

second stage classifies each candidate location as one of the foreground classes or as background using a convolutional neural network

Through a sequence of advances....papers

COCO benchmark

this
two-stage framework consistently achieves top accuracy on
the challenging COCO benchmark

A question

Could a simple one-stage detector achieve similar accuracy?

One stage detectors are applied over a regular, dense sampling of object locations, scales, and aspect ratios

Recent work

YOLO

SSD

promising results, yielding faster detectors with accuracy within 10-40% relative to state-of-the-art two-stage method

This paper pushes the envelop further

present a onestage object detector that, for the first time, matches the state-of-the-art COCO AP of more complex two-stage detectors, such as the Feature Pyramid Network (FPN) or Mask R-CNN variants of Faster R-CNN

R-CNN

Feature Pyramid Network (FPN)

Mask R-CNN

Faster R-CNN

To achieve this result, we identify class imbalance during training as the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy and propose a new loss function that eliminates this barrier

Address in R-CNN-like detectors

two-stage cascade & sampling heuristics

The proposal stage

DeepMask

RPN

.... rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples

Selective Search

EdgeBoxes

Second classification stage

sampling heuristics

fixed foreground-to-background ratio (1:3)

online hard example mining (OHEM)

are performed to maintain a manageable balance between foreground and background