complexity in machine recognition of Arabic language due to its cursive nature
is well known. Urdu is a popular language which is written in Arabic based
script but uses a special calligraphic style of writing known as Nastaliq. The
calligraphic nature of Nastaliq and other linguistic properties of Urdu
introduce many other complexities which must be kept in mind in the development
of OCR. This paper introduces all those complexities and open is-sues which are
unique to Urdu language and Nastaliq style or writing from OCR point of view.


Optical Character Recognition (OCR) is a branch
of Pattern Recognition which is used to recognize printed text, normally in
form of digitally scanned images or live text coming from drawing by a user through
some digital input device. Urdu language possess some of the properties which
are considered most challenging in character recognition world. The most common
of them is cursiveness which it inherits from Arabic. The complexities of
recognition of Urdu script are much more than that of Arabic script thus
require much more attention and out of the box thinking. This paper is written
specifically to introduce all the complexities, to the best of my knowledge,
which are unique in recognition of Urdu Nastaliq script.

Cursiveness is the nature of Urdu it means that
characters are joined with each other while written and take a new shape. This
characteristic of Arabic language makes it very di?cult for the machine to segment each character
separately and recognize it. Not every character in Urdu and Arabic connects
with the other characters and some connect only from one side. Some of the
characters in the character set are also used as a diacritic marks. These
include Toy (?) and Hamza ( ?). Separate diacritics are also used in Urdu like
Arabic such as zer (––?), zaber (––?), pesh (––?), shadd (––?) etc but are much
less common than in Arabic text. Dots are also very common and significant. In Urdu
a character may contain up to three dots above, below or inside it. 17 out of
38 characters in Urdu have dots, 10 of which have 1 dot, 2 have 2 dots and 5
characters have 3 dots. Characters in Urdu may also overlap
each other vertically.

Urdu is written in Nastaliq style unlike Arabic/Persian
which are written in Naskh style. Nastaliq is a calligraphic version known for
its beauty which originated by combining two styles, Naskh and Taliq. A less
elaborate version of style is used for writing printed Urdu. The credit of
computerizing Nastaliq goes to Mirza Ahmed Jameel who created 20,000 Nastaliq
ligatures in 1980, ready to be used in computers for printing. He called it
Noori Nastaliq. Many people followed and created their own Nastaliq style fonts
among which Jameel Noori Nastaliq, Alvi Nastaliq and Faiz Lahori Nastaliq are
popular. All the Nastaliq fonts fulfill the basic characteristics of Nastaliq
writing style.

Urdu Optical Character Recognition can be divided
into two major subcategories which are



 O?ine recognition means attempting to recognize
text which is already present in the form of printed or handwritten material.
Thus o?ine recognition can be further divided into two



Online recognitions refers to real time recognition
as user moves the pen two write something. Thus online recognition only
involves handwritten text. Online recognition is considered less complex as
compared to o?ine recognition because in online recognition
temporal information of pen traces are available, which is not the case in o?ine recognition. Most of the people who worked
in Urdu character recognition only attempted to recognize the isolated

Two major approaches followed for recognition of
complete Urdu text found in the literature are:


 Segmentation free.


