TICCLops / tesseract-ocr

Online Text Induced Corpus Clean-up

TICCL (Text Induced Corpus Clean-up) is a system designed by dr. Martin Reynaert to search a text collection for all existing variants of all words occurring in it. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These lists can then be used to automatically detect and correct spelling mistakes in the texts.

We developed the TICCLops online interface for the members of the Dutch CLARIN community. It combines TICCL with tesseract-ocr in order to enable users to transform scanned images of documents into automatically corrected, machine readable texts.