Optical Character Recognition for Coptic (OCR)
This page describes, how you can convert scanned documents (for example older books) into text or word files using free tools (OCR = Optical Character Recognition).
Although there are a lot of very good and accurate commercial tools for OCR like finereader from ABBYY or Omnipage from NUANCE, but they all lack of convenient ways of adding new Unicode languages. Besides they are relatively expensive.
One of the popular free OCR-engines which supports Unicode is tesseract-ocr. There are several free front ends (GUI interfaces) for Windows like: VietOCR and Paperfile FreeOCR.
In this page I will describe in more details, how you can OCR Coptic documents, which means converting Coptic images (tiff, image pdfs) into Unicode Coptic Text.
You have 3 options:
- Use tesseract directly : (applies for Linux and Windows). You will mainly have to enter all the commands manually in a Command Prompt. Generated text can then be opened and edited by any Unicode Editor.
- Use VietOCR as front end (Gui for Windows)
- Use Paperfile FreeOCR as front end (Gui for Windows)
- For achieving good results, scan the documents with a resolution of at least 300dpi. Save the scans in black-white tiff format. I would also recommend to use image programs to de-noise, deskew and clean the scanned image.
- Scan only documents containing pure Coptic text. Recognition quality of Coptic texts containing old fonts will be very poor, depending on the trained data.
- The overall performance can not keep up with commercial tools, but you will get a output file in Coptic Unicode.
Tesseract
You can download tesseract at: http://code.google.com/p/tesseract-ocr/. Follow the instructions there to install. Tesseract can not originally recognize Coptic. You must train it first. This means that you must have some sample tiff-pages with a coptic text. You must then "tell" tesseract, which letter is found where in the image. on the home page of tesseract the training process is described in more details. The more we let tesseract learn different samples with different fonts, the better is the overall recognition quality. Training it for just one font type will lead to almost perfect recognition quality but only for this type of font. The process of training is some how tedious. There are some graphical tools that simplify somehow the training process like:- jTessBoxEditor (my favorite from the vietocr developer.
- bbtesseract with a very clear Gui and options.
- Tessboxer.
- Tesseract version 2 (right click on link to download) - trained with several fonts.
- Tesseract version 3 (right click on link to download) - trained with several fonts.
VietOCR.NET
Follow the following steps if you want to make use of my trained data along with tesseract and VietOCR.NET:- download and install VietOCR.NET
- open the file "ISO639-3.xml" which is installed (normally at C:\Program
Files\VietUnicode\VietOCR.NET\Data) with any editor and add the following line:
<entry key="cop">Coptic</entry>. - download the Coptic files I have generated(see above)
and copy into the directory:
C:\Program Files\VietUnicode\VietOCR.NET\tessdata. - Now, if you start VietOCR.NET you should be able to select Coptic as language. Use the Font menu to select a Coptic Unicode Font.
- Open a scanned document (tiff format, black/white, no compression), mark the region you would like to recognize and then press "OCR". You can use this sample for testing.
- If you are not satisfied with the results, you can optimize the recognition for your Font type. Follow the instructions described at tesseract home page for training your own data. CAUTION: it is time consuming!.
FreeOCR
Working with the FreeOCR front end is similar to VietOCR above. You can get FreeOCR at: Paperfile FreeOCR. The Coptic Files must be unzipped to C:\WINDOWS\tessdata
good
luck!
Moheb Mekhaiel
