Optical Character Recognition
The Optical Character Recognition technology is still an area where extensive research is required. Here's a brief idea on what Optical Character Recognition is, the problems involved in it, and the strides technology has made towards its perfection.
Optical Character Recognition (OCR) is a method of making printed, typewritten or handwritten data understandable and readable with a computer. The intention of OCR is generally to store the data in a digital format from where it can be edited on a machine and, most importantly, made searchable with keywords. Optical Character Recognition generally involves deciphering the data by a machine, converting it into a machine readable format and storing it on the machine, which is usually a computer.
How Optical Character Recognition is done
The first step in Optical Character Recognition is to scan and process the document. Then a layer of the OCR text (the OCR symbols) is added behind each image that's present in the scanned document. In order to make sure that the characters are recognized properly, another filter may be used in conjunction with this first.
With the filters in place, individual characters are identified from a dictionary that's present in the software. The process is to match a pattern with an already existing pattern in the dictionary in order to find out what the character might stand for. This is then converted into readable text. The text is what is visible to the user, and this is the text produced after the Optical Character Recognition.
If the document is too smudgy, high-end technologies such as multi-light image capture technology might be employed. This is also helpful when the document has shadows on it due to page fold areas.
Problems in Optical Character Recognition today
The benefits of OCR are obviously quite clear, but there has still a lot of advancement to be done in the field. OCR is not yet a perfect science, and every document that is scanned for Optical Character Recognition is rife with several errors. There are many reasons why perfection in OCR is proving to be elusive:-
As already mentioned before, Optical Character Recognition has not yet achieved perfection. Users must be prepared for several errors. That is the reason why OCR always follows a human review.
Since OCR tries to concern itself with vastly different kinds of material, the successes in various fields differ vastly too. Let us take them up one by one.
How Optical Character Recognition is done
The first step in Optical Character Recognition is to scan and process the document. Then a layer of the OCR text (the OCR symbols) is added behind each image that's present in the scanned document. In order to make sure that the characters are recognized properly, another filter may be used in conjunction with this first.
With the filters in place, individual characters are identified from a dictionary that's present in the software. The process is to match a pattern with an already existing pattern in the dictionary in order to find out what the character might stand for. This is then converted into readable text. The text is what is visible to the user, and this is the text produced after the Optical Character Recognition.
If the document is too smudgy, high-end technologies such as multi-light image capture technology might be employed. This is also helpful when the document has shadows on it due to page fold areas.
Problems in Optical Character Recognition today
The benefits of OCR are obviously quite clear, but there has still a lot of advancement to be done in the field. OCR is not yet a perfect science, and every document that is scanned for Optical Character Recognition is rife with several errors. There are many reasons why perfection in OCR is proving to be elusive:-
- People have hugely different styles of writing. Added to that, most people do not write with the same speed, conciseness and density of ink all the time. Usually, there is no remote similar pattern that can be discerned between the writing styles of two different people. That makes it very difficult for any software to recognize common patterns. Today, OCR works much better for discrete handwritten characters than for cursive handwriting. The stringier the handwriting, the more difficult it is for OCR to identify it.
- OCR works well only if the letters are clearly discernible. This has to do with a lot of things, with the color and the tidiness of the paper it is printed on, to the oldness of the paper. It is very difficult to identify the symbols on a dirty and smudged paper or on a paper that has aged.
- Another slight problem may be the unevenness of the paper on which the matter to be recognized is present. The paper might be creased, or if it is a page of a book, it will be very difficult to identify the letters that are present in the central area of the book, where shadows might be created due to the inward slope.
- The major failing yet is in finding a common language for all forms of OCR to recognize patterns in the text that it needs to identify. Most OCR methods today use several codified symbols to make the character recognition. Whatever success has been achieved yet in OCR is due to the establishment of these symbolic patterns.
As already mentioned before, Optical Character Recognition has not yet achieved perfection. Users must be prepared for several errors. That is the reason why OCR always follows a human review.
Since OCR tries to concern itself with vastly different kinds of material, the successes in various fields differ vastly too. Let us take them up one by one.
- OCR in Text Identification - Among the written scripts, understanding Latin script has been honed to near perfection. There is only a 1% error rate in Latin character recognition. But Latin characters are simpler (with fewer strokes, curves and lines) than the other characters used globally. Scripts such as Chinese characters are very difficult for OCR. Printed text is better recognized than handwritten text.
- OCR in Music Identification - The music industry has attempted to remove the lines from the sheet music to enable it for OCR. This has given a fair degree of success. But it is very difficult to understand handwritten music. There is only one software application in the world that does it – Photoscore Ultimate 5 from Neuratron. But the output is not even close to being perfect.
- OCR in Magnetic Ink Character Identification - Magnetic ink character identification is very important in banks where checks need to be processed. There are special fonts such as the E-13B and CMC-7 used for magnetic ink character identification, a process that is today used globally. This kind of identification enjoys a high degree of authenticity to the real matter.

Use the feedback form below to submit your comments.

Use the form below to email this article to your friends.

- Handwriting Analysis - Graphology for Character Analysis
- Early Cancer Detection through Handwriting Analysis
- Handwriting Analysis - Have you ever been ripped off?
- Handwriting Analysis
- Writing for the Web
- Use statistics to improve your writing
- How to Write Original Articles
- Write Articles Online and Make Money
- Do they get it? Writing for the web
- Online Writing - Is This The Perfect Title?
- How to Write an Obituary
- Short Story Ideas: Elements of a Short Story




