Table of Contents
INTRODUCTION 1
SUBJECT THEORY 3
Segmentation 3
Alignment 3
HTR systems 4
Related Work 5
Machine learning for image segmentation 6
DATA & SOFTWARE 7
Complications with the data 7
Software & Python packages 8
Ethics and conflict of interest 9
ALIGNMENT ALGORITHM DEVELOPMENT 11
Page image preprocessing 11
Line segmentation 14
Word segmentation 16
Interactive correction 22
Self learning 24
User interface 25
PERFORMANCE EXPERIMENTS 27
RESULTS 31
Performance of the algorithm 31
Visualisation of the algorithm pipeline 35
DISCUSSION 39
Performance analysis 39
The importance of ground truth quality 40
CLOSING REMARKS 43
Conclusion 43
Future work 43
REFERENCES 46
SUPPLEMENTARY FILES 48
Abbreviations
GMM Gaussian Mixture Model
GT Ground Truth
HPP Horizontal Projection Profiling HTR Handwritten Text Recognition IoU Intersection over Union
ROI Region Of Interest
Introduction
The process of segmenting a document image into text lines or words and then align- ing them to create an annotation is an important preprocessing step in many cases of document understanding. Handwritten text found in historical records, such as tables, can not be automatically transcribed by traditional Optical Character Recognition sys- tems. Therefore, Handwritten Text Recognition (HTR) is being used, which has shown exceptional performance in fields such as text-line segmentation, keyword spotting, and character recognition, among others (De Gregorio et al. 2023). HTR models usually require large amounts of annotated data for learning. Such annotations can be obtained in a process called alignment, which makes use of manual transcriptions of some parts of the document, and links corresponding text images to transcribed words. This project started as part of an effort to digitize historical document images.
This thesis aims to develop a pipeline that covers the process of going from a raw his- torical document image to segmented words and then aligning the segmented words to a transcription or ground truth (GT) with interactive self-learning built into the algorithms. The start of a suitable user interface is also developed. The research objectives were:
Develop an interactive algorithm for the alignment of historical document images
Select appropriate methods for all parts of the pipeline
Integrate self-learning in the algorithm
Evaluate the performance of the algorithm
Key questions that were to be answered are: What factors affect the segmentation and alignment? How can self-learning be integrated?
The algorithm created for this thesis is intended to work on one image at a time and was primarily tested on the data set Labour’s Memory (Chapter 3). For the algorithm to work as intended, some conditions must be fulfilled for the images used. The text lines in the document images need to be approximately straight. The document images can only have a limited amount of noise; if there is too much noise in the image, the algorithms will not perform as intended. The words in the image can only overlap to an extent, although the characters in the words can overlap as the algorithm only needs to segment the image into words. The algorithms are only intended to segment lines and words, symbols such as [ , : ; ” - ) ( ] are not individually segmented as they are not
words.
Figure 1: An example image, taken from the Labour’s Memory dataset
Dostları ilə paylaş: |