Two bigrams based language model for auto correction of Arabic OCR errors

Habeeb, Imad Q. and Mohd Yusof, Shahrul Azmi and Ahmad, Faudziah (2014) Two bigrams based language model for auto correction of Arabic OCR errors. International Journal of Digital Content Technology and its Applications (JDCTA), 8 (1). pp. 72-80. ISSN 2233-9310

PDF
Restricted to Registered users only
Download (942kB) | Request a copy

Official URL: http://www.aicit.org/jdcta/global/paper_detail.htm...

Abstract

In Optical character recognition (OCR), the characteristics of Arabic text cause more errors than in English text.In this paper, a two bi-grams based language model that uses Wikipedia's database is presented.The method can perform auto detection and correction of non-word errors in Arabic OCR text, and auto detection of real word errors. The method consists of two parts: extracting the context information from Wikipedia's database, and implement the auto detection and correction of incorrect words.This method can be applied to any language with little modifications.The experimental results show successful extraction of context information from Wikipedia's articles. Furthermore, it also shows that using this method can reduce the error rate of Arabic OCR text.

Item Type:	Article
Subjects:	Q Science > QA Mathematics > QA76 Computer software
Divisions:	School of Computing
Depositing User:	Dr. Shahrul Azmi Mohd. Yusof
Date Deposited:	11 Nov 2014 09:03
Last Modified:	15 May 2016 01:07
URI:	https://repo.uum.edu.my/id/eprint/12602

Actions (login required)

View Item

Altmetric