Please enable JavaScript.
Coggle requires JavaScript to display documents.
DenseRAN for Offline Handwritten Chinese Character Recognition - Coggle…
DenseRAN for Offline Handwritten Chinese
Character Recognition
介紹
手寫中文辨識
(Handwritten Chinese characters recognition)
挑戰
相似字符之間容易混淆
(confusion between similar characters)
每個人手寫風格不同
(distinct handwriting styles across individuals)
字符種類繁多
(the large number of character classes)
數據採集類別
(the type of data acquisition)
online
offline
輸入影像
灰階圖
(gray-scaled images)
傳統方法
圖像標準化
(image normalization)
特徵提取
(feature extraction)
降維
(dimension reduction)
分類器訓練
(classifier training)
CNN
第一個應用在中文辨識的CNN
multi-column deep neural network (MCDNN)
特點
只能識別訓練集中出現的漢字,而不能識別沒看過的漢字
(can only recognize Chinese characters appeared in training set and have no ability to recognize unseen Chinese characters)
將每個漢字作為一個整體來對待,而不考慮漢字之間的相似性和子結構。
(treat each Chinese character as a whole without considering the similarites and sub-structures among Chinese characters)
漢字特點
主要是語素文字
(mainly logographic)
組成
基本部首
(basic radicals)
空間結構
(spatial structures)
基於部首的漢字辨識
(radical-based Chinese character recognition)
其他研究
10
首先分別檢測部首,然後使用分層部首匹配法將部首組成一個字符
( first detected radicals separately and then composed radicals into a character using a hierarchical radical matching method)
11
試圖將字符過度分割成候選部首,而所提出的方法只能處理左右結構,而過度分割帶來了許多困難
(tried to over-segment characters into candidate radicals while the proposed way could only handle the left-right structure and over-segmentation brings many difficulties)
12
一種用於基於部首的漢字識別的多標籤學習方法。 它把一個角色類變成了幾個部首和空間結構的組合
(a multi-label learning for radical-based Chinese character recognition. It turned a character class into a combination of several radicals and spatial structures)
通常,這些方法難以將字符分割成部首,並且在分析部首間的結構時缺乏靈活性。更重要的是,他們通常無法處理這些沒看過的漢字類別
本研究
第三步中樹的每個葉節點表示部首,每個非葉節點表示其內部結構。
(Each leaf node of the tree in third step represents radicals and each nonleaf node represents its internel structure)
名稱
具有緊密連接架構的基本分析網絡(DenseRAN)
(radical analysis network with densely connected architecture (DenseRAN))
主要思想
標題與正確的標記匹配時,可以成功識別手寫漢字
(A handwritten Chinese character is successfully recognized when its caption matches the groundtruth)
將漢字分解為描述其內部部首和部首之間結構的標題
(decompose a Chinese character into a caption that describes its internal radicals and structures among radicals)
不同於其他研究的地方
通過基於注意力的編碼器-解碼器模型自動學習部首分割和結構檢測
(the radical segmentation and structure detection are automatically learned by attention based encoder-decoder model)
參考:D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,”
(
https://arxiv.org/pdf/1409.0473.pdf
)
限制
基於對部首和結構的分析,僅當在訓練集中看到部首時,提出的DenseRAN才具有識別看不見的漢字類別的能力。
(Based on the analysis of radicals and structures, the proposed DenseRAN possesses the capability of recognizing unseen Chinese character classes only if the radicals have been seen in training set.)
整體結構
示意圖
步驟
輸入的原始數據是灰度圖像
(The raw data of input are gray-scaled images)
然後,利用GRU逐步將高級表示解碼為輸出字幕
(Then a RNN with gated recurrent units (GRU)decodes the highlevel representations into output caption step by step)
參考:Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,”
(
GRU
)
首先使用DenseNet將輸入圖像編碼為高級視覺矢量
(first encodes input image to high-level visual vectors using a densely connected convolutional networks (DenseNet))
參考:Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger,“Densely Connected Convolutional Networks,”
(
Densenet
)
我們採用內置在解碼器中的基於覆蓋的空間注意模型來同時檢測部首和內部二維結構
(We adopt a coverage based spatial attention model built in the decoder to detect the radicals and internal two-dimensional structures simultaneously)
參考:J. Zhang, J. Du, S. Zhang, D. Liu, Y. Hu, J. Hu, S. Wei, and L.Dai, “Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition,”
(
attention
)
參考:J. Zhang, J. Du, and L. Dai, “Multi-scale attention with dense encoder for handwritten mathematical expression recognition,”
(
attention
)
Attention in Deep Learning
base
1 more item...
漢字分解
每個漢字都可以自然地分解為部首和空間結構的標題
(Each Chinese character can be naturally decomposed into a caption of radicals and spatial structures.)
規則:A. Madlon-Kay, “cjk-decomp.”
https://github.com/amake/cjk-decomp
字符標題
(the character caption)
部首(radicals)
空間結構(spatial structures)
single:本身就是部首
a:左右結構
d:上下結構
s:環繞結構
sb:底部環繞結構
sl:左側環繞結構
st:頂部環繞結構
sbl:左下側環繞結構
stl:左上側環繞結構
str:右上側環繞結構
w:結構內
lock:鎖結構
r:單一部首重複多次
示意圖
一對大括號
DenseRAN詳細結構
Dense encoder
基礎
DenseNet
使用原因
已被證明是用於各種計算機視覺任務的良好特徵提取器
(has been proven to be good feature extractors for various computer vision tasks)(
Densenet
)
組成
dense block
特點
每層都直接連接到所有後續層
數學式
我們將Hl()表示為某個dense blocks中第l層的捲積函數,則可以表示第l層的輸出
[x0,x1,...,xl-1]表示在同一塊中由0,1 ...,l-1生成的輸出特徵圖的串聯
([x0,x1,...,xl-1] denotes the concatenation of the output feature maps produced by 0,1...,l-1 in the same block)
Densenet使用concatenation
ResNet使用add
growth rate(k)
意味著每個Hl()產生k個特徵圖。
為了進一步提高計算效率,我們在每個DenseBlock中使用瓶頸層
(In order to further improve computational efficiency, we use bottleneck layers in each DenseBlock)
只添加bottleneck layers的,又稱DenseNet-B
在每個3x3卷積之前引入1x1卷積,以將輸入的特徵圖數量減少到4k
(1x1 convolution is introduced before each 3x3 convolution to reduce the number of feature maps input to 4k)
參考
這些DenseBlock通常由transition layer和pooling layer分隔
對於每個組成層,激活前Batch normalization(BN)和ReLU,然後使用k通道的輸出特徵圖完成3×3
(For each composition layer, Pre-Activation Batch Norm (BN) and ReLU, then 3×3 Conv are done with output feature maps of k channels)(BN-ReLu-Conv)
transition layer
1x1卷積層,其壓縮率θ= 0.5
只添加Further Compression,又稱DenseNet-C
0<θ≤1
若transition layer的輸入特徵圖(前面連接的dense block中包含的輸出特徵圖)數量為n,則transition layer將生成θn個輸出特徵圖
Input
224x224x3
第一卷積層7x7x64,stride=2x2的捲積(對輸入圖像執行),隨後是2x2的最大池化層
(The first convolution layer has 64 convolutions of kernel size 7x7 with stride 2 which is performed on the input images, followed by a 2x2 max pooling layer)
參考:J. Zhang, J. Du, and L. Dai, “Multi-scale attention with dense encoder for handwritten mathematical expression recognition,”
(
Dense encoder
)
修改的部分
代替在完全連接的層之後提取特徵,我們丟棄編碼器中的完全連接的層和softmax層,稱為完全卷積神經網絡
(Instead of extracting features after a fully connected layer, we discard fully connected layer and softmax layer in encoder, called fully convolutional neural networks)
原因
這允許解碼器通過從提取的視覺特徵中選擇特定部分來選擇性地關注圖像的某些部分
(This allows the decoder to selectively pay attention to certain parts of an image by choosing specific portions from the extracted visual features.)
每個Denseblock的layer設置為D = 16,在每個Denseblock中,有D個1x 1卷積層,每個層之後是3x3卷積層。
input
32x32x3
我們放棄了DenseBlocks之間的pooling layer
原因
input改為32x32
若經過這麼多池化操作後,最終特徵圖的大小約為2x2,太小而無法獲得良好的關注結果
(after so many pooling operation, the size of final feature map is about 2x2, which is too small to get good attention results)
參考
在每個卷積層之後連續執行Batch normalization和ReLU。(Conv-BN-ReLu)
數學式
可以將其表示為尺寸為HxWxD的三維數組
(can be represented as a three-dimensional array of size HxWxD)
D:每個Denseblock的layer
數組中的每個元素都是一個D維向量,對應於圖像的局部區域
(Each element in array is a D-dimensional vector that corresponds to a local region of the image)
GRU decoder with attention model
數學式
解碼器生成輸入中文字符的標題。 輸出標題Y由K中的1個編碼符號序列表示
K:詞彙表中的總符號數,包括基本部首,空間結構和一對大括號
C:是標題的長度
基於覆蓋的空間注意力模型fatt被參數化為多層感知器
(coverage based spatial attention model fatt)
F:覆蓋向量
Q:filter
基於過去注意力概率的總和來計算
阿法ti表示時間t時ai的空間注意係數。
當設n'=關注維度、q表示過濾器Q的特徵圖
則vatt屬於Rn',Watt屬於Rn'x n,Uatt屬於Rn'x D,Uf屬於Rn'x q。
當weight =阿法ti,可計算出
每個預測單詞的概率由上下文向量ct,當前GRU隱藏狀態st和先前單詞yt-1使用以下公式計算
E : embedding matrix
m和n是嵌入和GRU解析器的尺寸。
GRU parser
GRU解析器採用兩個單向GRU層來計算隱藏狀態st
st-1 : 表示在時間t -1的隱藏狀態
^st:時間t的GRU隱藏狀態預測
因為註釋序列(annotation sequence)L的長度是固定的,而標題C的長度是可變的
DenseRAN通過在每個時間步計算中間固定大小向量ct來解決此問題(DenseRAN addresses this problem by computing an intermediate fixed-size vector ct at each time step)
ct是當時漢字圖像相關部分的動態表示
我們利用單向GRU和上下文向量ct逐步生成字幕
(We utilize unidirectional GRU and the context vector ct to produce captions step by step)
參考:Yoshua Bengio, Patrice Simard, and Paolo Frasconi, (“Learning longterm dependencies with gradient descent is difficult,”)
unidirectional GRU