Algorithm
for AI Chips company
Efficient Deep Learning
(EDL)
Learning Theory
to fix the lack of accuracy
Hardware Software Co-design
(HSC)
balance between
computations and I/O operations
compress bitwidth of data
simplify nets architecture
for distributed setting
AllReduce algorithm
tricks for large batch training,
such as modified optimisers, regularisers, normalisation, etc.
why
how
Existing hardware is not fit for models
Existing models has imbalanced utilization upon hardware
Specific hardware-oriented model design
Solutions from hardware-side
General solutions from software-side
variety of other optimisers, such as evolutionary methods, swarm methods, interiour point methods., etc.
gradient-based optimisers
Comparison between Training and Inference
Bandwidth between Ext-mem and host-mem
Bandwidth between Ext-mem and L2-mem
Hostmem Memory Requirements
L2 Memory Requirements
BatchSize
Sensitivity to Precision Loss
knowledge distillation,
compact network design
model compression
network pruning
operation fusions
winograd method
asynchoronous BP
polyhedral optimisation
BP re-forwarding
model quantisation
binary/ternary nets
mixed precision training