EE Seminar: Word-Spotting applications for historical documents
Speaker: Adi Silberpfennig
M.Sc. student under the supervision of Prof. Lior Wolf
Sunday, February 5th, 2017 at 15:30
Room 011, Kitot Bldg., Faculty of Engineering
Word-Spotting applications for historical documents
Historical documents have been undergoing large-scale digitization over the past years, bringing massive image collections available on-line. Optical character recognition (OCR) quality for historical manuscripts and for documents printed in old typefaces, is still lacking. As an alternative or in addition, one can perform an image-based search.
In this talk we will show a simple and efficient pipeline for word spotting in historical documents and how it is utilized for several applications;
An effective unsupervised pipeline for OCR betterment is proposed. It employs a baseline OCR engine as a black box plus a dataset of unlabeled document images. Given a new document to be analyzed, the black-box recognition engine is first applied. For each result, word spotting is carried out within the dataset and then a process for OCR improvement is applied using the spotting results.
We also present an image based approach for the retrieval of related articles in a newspaper. Given a dictionary, synthetic images are generated for every word in it, and each of these words is considered a query. Given a set of unlabeled documents they are first fed into the word spotting engine. Then, based on the spotting results, a normalized Tf-Idf vector representation is computed for every document and the articles retrieval is performed by a nearest-neighbor search.
Another utility shown here is an operational word spotting engine. We developed, in collaboration with the Friedberg Genizah Project, a real-time word spotting engine, incorporated in a large scale historical manuscripts collection – The Cairo Genizah.