Combining Multiple Segmentations in Scene Text Recognition

Report
Lukáš Neumann and Jiří Matas
Centre for Machine Perception, Department of Cybernetics
Czech Technical University, Prague
1
1.
2.
3.
4.
5.
6.
7.
End-to-End Scene Text Recognition Problem Introduction
The TextSpotter System
Character Detection as Extremal Region
(ERs) Selection
Line formation & Character Recognition
Character Ordering
Optimal Sequence Selection
Experiments
Neumann, Matas, ICDAR 2013
2/21
Bounding Box=[240;1428;391;1770]
Content="TESCO"
Input:
(AVI)
Output:
Digital image (BMP, JPG, PNG) / video
Lexicon-free method
Set of words in the image
word = (horizontal) rectangular bounding box,
text
content
Neumann,
Matas,
ICDAR 2013
3/21
1.
2.
3.
4.
Multi-scale Character Detection [1]
with Gaussian Pyramid (new)
Text Line Formation [2]
Character Recognition [3]
Optimal Sequence Selection (new)
[1] L. Neumann, J. Matas, “Real-time scene text localization and recognition”,
CVPR 2012
[2] L. Neumann, J. Matas, “Text localization in real-world images using
efficiently pruned exhaustive search”,
ICDAR 2011
[3] L. Neumann, J. Matas, “A method for text localization and recognition in
real-world images”, ACCV 2010
Neumann, Matas, ICDAR 2013
4/21
Input image
(PNG, JPEG,
BMP)
1D projection
<0;255>
(grey scale,
hue,…)
Extremal regions with
threshold 
( =50, 100, 150,
200)
Neumann, Matas, ICDAR 2013
5/21





Let image I be a mapping I: Z2 
S Let S be a totally ordered
set, e.g. <0, 255>
Let A be an adjacency relation
(e.g. 4-neigbourhood)
Region Q is a contiguous subset
w.r.t. A
(Outer) Region Boundary δQ is set
of pixels adjacent but not
belonging to Q
Extremal Region is a region
where there exists a threshold 
 = 32
that separates the region and
its boundary
Assuming character is an ER, 3 parameters still
have
to be determined:

: pQ,qQ
: I(p) <   I(q)
1. Threshold
2. Mapping to a totally order set (colour space
projection)
3. Adjacency relation
Neumann, Matas, ICDAR 2013
6/21



Character boundaries are often fuzzy
It is very difficult to locally determine the threshold
value, typical document processing pipeline (image
binarization OCR) leads to inferior results
Thresholds that most probably correspond to a character
segmentation are selected using a CSER classifier [1],
multiple hypotheses for each character are generated
[1] L. Neumann and J. Matas, “Real-time scene text localization and
recognition”, CVPR 2012
Neumann, Matas, ICDAR 2013
7/21



p(r|character) estimated at
each threshold for each
region
Only regions corresponding
to local maxima selected by
the detector
Incrementally computed
descriptors used for
classification [1]
◦
◦
◦
◦
Aspect ratio
Compactness
Number of holes
Horizontal crossings
Trained AdaBoost classifier
with decision trees
calibrated to output
probabilities
 Linear complexity, real-time
performance (300ms on an
L. 800x600px
Neumann and J.
Matas, “Real-time scene
image)

[1]
recognition”, CVPR 2012
Neumann, Matas, ICDAR 2013
text localization and
8/21




Color space projection maps a color image into a totally
ordered set
Trade-off between recall and speed (although can be easily
parallelized)
Standard channels (R, G, B, H, S, I) of RGB / HSI color
space
85.6% characters detected in the Intensity channel,
combining all channels increases the recall to 94.8%
Source Image
Neumann, Matas, ICDAR 2013
Intensity
Channel
(no threshold
exists for the
letter “A”)
Red Channel
9/21



Pre-processing with a
Gaussian pyramid
alters the adjacency
relation
At each level of the
pyramid only a
certain interval of
character stroke
widths is amplified
Not a major overhead
as each level is 4
times faster than the
previous one, total
processing takes ~
4/3 of the first
level (1 + ¼ + ¼2 …)
Characters formed
of multiple
small regions
Neumann, Matas, ICDAR 2013
Multiple
characters
joint together
10/21



Regions agglomerated
into text lines
hypotheses by
exhaustive search [1]
Each segmentation
(region) labeled by a
FLANN classifier
trained on synthetic
data [2]
Multiple mutually
exclusive
segmentations with
different label(s)
present in each text
line hypothesis
P
A
ilI
n
m
f
f
n
[1] Neumann, Matas, Text localization in real-world images using efficiently
pruned exhaustive search, ICDAR 2011
[2] Neumann, Matas, A method for text localization and recognition in realworld images”, ACCV 2010
Neumann, Matas, ICDAR 2013
11/21



Region A is a predecessor of a region B if A immediately
precedes B in a text line
Approximated by a heuristic function based on text
direction and mutual overlap
The relation induces a directed graph for each text line
Neumann, Matas, ICDAR 2013
12/21


The final region sequence of each text line is selected as
an optimal path in the graph, maximizing the total score
Unary terms
◦ Text line positioning (prefers regions which “sit nicely” in the text
line)
◦ Character recognition confidence

Binary terms (regions pair compatibility score)
◦ Threshold interval overlap (prefers that neighboring regions have
similar threshold)
◦ Language model transition probability (2nd order character model)
Accommodation
Neumann, Matas, ICDAR 2013
13/21
ICDAR 2011 Dataset – Text Localization
pipeline
SM+SS
SM+MS
SWT+SS
SWT+MS
MLM+SS
MLM+MS
recall
45.9
55.5
38.0
41.0
62.1
67.5
precision
69.8
75.2
66.0
80.0
85.9
85.4
f
55.4
63.8
48.0
54.0
72.0
75.4
time /
1.87s
2.35s
0.60s
0.84s
2.52s
3.10s
Single Maximum (SM)
Segmentation with the highest
Multiple Local Maxima
Segmentations which correspond to
maxima of the CSER score
Stroke Width Transform
Reimplementation of character
based on Epshtein et al. [1]
SS = Single Scale
MS = Multiple Scales (Gaussian
[1] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes
with stroke width transform”, CVPR 2010
Neumann, Matas, ICDAR 2013
14/21
ICDAR 2011 Dataset – Text Localization
pipeline
Proposed method
Shi’s method [1]
Kim’s method [2]
(ICDAR 2011 winner)
Neumann & Matas [3]
Yi’s Method [4]
TH-TextLoc System
recall
precision
f
67.5
63.1
85.4
83.3
75.4
71.8
62.5
64.7
83.0
73.1
71.3
68.7
58.1
57.7
67.2
67.0
62.3
62.0
[1] C. Shi, C. Wang, B. Xiao, Y. Zhang, and S. Gao, “Scene text detection
using graph model built upon maximally stable extremal regions”, Pattern
Recognition Letters, 2013
[2] A. Shahab, F. Shafait, and A. Dengel, “ICDAR 2011 robust reading
competition challenge 2: Reading text in scene images”, ICDAR 2011
[3] L. Neumann and J. Matas, “Real-time scene text localization and
recognition”, CVPR 2012
[4] C. Yi and Y. Tian, “Text string detection from natural scenes by
structure-based partition and grouping”, Image Processing, 2011
[5] S. M. Hanif and L. Prevost, “Text detection and localization in complex
Neumann,
scene
images
Matas,
using
ICDAR
constrained
2013
adaboost algorithm”, ICDAR 2009
15/21
ICDAR 2011 Dataset – End-to-End Text
Recognition
pipeline
recall
precision
f
Proposed method
Neumann & Matas
2012) [1]
37.8
39.4
38.5
37.2
37.1
36.5
Percentage of words correctly recognized
without any error – case-sensitive comparison
(ICDAR 2003 protocol)
[1] L. Neumann and J. Matas, “Real-time scene text localization and
recognition”, CVPR 2012
Neumann, Matas, ICDAR 2013
16/21
chips
cut
Neumann, Matas, ICDAR 2013
CABOT
PLACF
FREEDON
17/21



Multi-scale processing / Gaussian Pyramid
improves text localization results without a
significant impact on speed
Combining several channels and postponing the
decision about character detection parameters
(e.g. binarization threshold) to a later stage
improves localization and OCR accuracy
Method current state
◦ The method placed second in ICDAR 2013 Text
Localization competition, 1.4% worse than the winner
(f-measure)
(unfortunately, end-to-end text recognition is not
part of the competition)
◦ Online demo available at http://www.textspotter.org/
◦ OpenCV implementation of the character detector in
progress by the open source community

Future work
◦ OCR accuracy improvement
◦ Overcoming limitations of CC-based methods (e.g. nonlinearity non-robustness caused by a single pixel) 18/21
Neumann, Matas, ICDAR 2013

similar documents