Please enable JavaScript.
Coggle requires JavaScript to display documents.
Tesseract, Training data generation, Always leverage any priori knowledge…
Tesseract
-
-
-
-
psm 4 -> assume single column of text of variable size, text will be grouped row-wise (OCRing a receipt)
psm 5 -> assume single uniform block of vertically aligned text (like psm 4 for but for rotated images)
-
-
-
-
-
-
-
-
Multi-Column Table OCR
-
grouping can be done by Hierarchical Agglomerative Clustering (HAC)
- initial data points as individual cluster
- start grouping observations with distance < T (T is a predefined threshold)
repeat this until no new cluster can be formedAgglomerative Clustering in scikit-learn library
if distance threshold is not specified during HAC, everything will be clustered until we end up with a single column
larger the distance between text in cells, larger the distance-threshold should be
-
-
tesseract is not off the shelf solution, we need to apply feature extraction using machine learning and deep learning techniques
image_to_data (output_type=output.DICT, config=options)
-
-
blacklist: list of characters which under no circumstances should be included in the output --blacklist ''*#'"
by default, tesseract expects a page full of text
when ever inappropriate results are obtained, try adjusting the psm
tesseract works best when there is very clean segmentation of text from the background (so we train domain specific classifiers and detectors)
by adding padding to ROI, we can expand bounding box co-ordinates and improve recognition.
-
Training data generation
-
-
-
different combinations of scanned images (si) and computer generated images (cgi) to prepare the training data
-
-
-
-