Abstract:
Letter sequence statistics in text is used to solve many text processing tasks. One of the most difficult tasks is automatic error correction in text. Errors appear in the text when it is either typed or scanned. Four types of errors can usually befound in a typed text: one letter missing, one letter extra, transposition of two letters and one letter wrong. Absolutely different types of errors appear when a text is scanned. Generally speaking error types vary and depend on documents quality, font type and text recognition program. To find the erroneous word a dictionary is usually used. Since many text elements are personal names, abbreviations, abridgements, firm names, they can not be found in the dictionary and do not need to be corrected. That is why a block determining these elements is necessary. We determined erroneous words according to their entropy on the letter trigram statistical model basis. We found that almost all words with the entropy higher than 4,5 are erroneous. When the most frequent errors were analyzed the confusion table was created to determine the correct word. The word with minimal entropy is considered to be correct.