Multi-Character Field Recognition for Arabic and Chinese Handwriting
Daniel Lopresti
George Nagy
Sharad Seth
Xiaoli Zhang
Lehigh University
Bethlehem, PA 18015
lopresti@cse.lehigh.edu
Rensselaer Polytechnic
Institute DocLab
Troy, NY 12180
nagy@ecse.rpi.edu
University of Nebraska
Lincoln, NE 68588
seth@cse.unl.ed
Rensselaer Polytechnic
Institute DocLab
Troy, NY 12180
zhangxl@rpi.edu
Abstract
Two methods, Symbolic Indirect Correlation (SIC) and
Style Constrained Classification (SCC), are proposed for
recognizing handwritten Arabic and Chinese words and
phrases. SIC reassembles variable-length segments of an
unknown query that match similar segments of labeled
reference words. Recognition
is based on
the
correspondence between the order of the feature vectors
and of the lexical transcript in both the query and the
references. SIC implicitly incorporates language context
in the form of letter n-grams. SCC is based on the notion
that the style (distortion or noise) of a character is a good
predictor of the distortions arising in other characters,
even of a different class, from the same source. It is
adaptive in the sense that with a long-enough field, its
accuracy converges to that of a style-specific classifier
trained on the writer of the unknown query. Neither SIC
nor SCC requires the query words to appear among the
references.
1 Introduction
From the perspective of character recognition, Arabic and
Chinese are at the opposite ends of the spectrum. The
former has a small alphabet with word-position dependent
allographs,
is quasi-cursive, and has “diacritics”,
ascenders and descenders. The latter has an indefinitely
large number of classes (of which only the first ~20,000
have been coded), essentially word-level symbols (many
with a radical-based substructure), and fixed-pitch block
characters. Arabic strokes can be approximated by arcs of
circles, while most Chinese strokes are straight, with a
~1:7 range in width (like brush strokes), and a flourish at