Algorithm
for AI Chips company

Efficient Deep Learning
(EDL)

Learning Theory
to fix the lack of accuracy

Hardware Software Co-design
(HSC)

balance between
computations and I/O operations

compress bitwidth of data

simplify nets architecture

for distributed setting

AllReduce algorithm

tricks for large batch training,
such as modified optimisers, regularisers, normalisation, etc.

why

how

Existing hardware is not fit for models

Existing models has imbalanced utilization upon hardware

Specific hardware-oriented model design

Solutions from hardware-side

General solutions from software-side

variety of other optimisers, such as evolutionary methods, swarm methods, interiour point methods., etc.

gradient-based optimisers

Comparison between Training and Inference

Bandwidth between Ext-mem and host-mem

Bandwidth between Ext-mem and L2-mem

Hostmem Memory Requirements

L2 Memory Requirements

BatchSize

Sensitivity to Precision Loss

knowledge distillation,

compact network design

model compression

network pruning

operation fusions

winograd method

asynchoronous BP

polyhedral optimisation

BP re-forwarding

model quantisation

binary/ternary nets

mixed precision training