Semi-automatic Segmentation & Alignment of Handwritten

Yüklə 11,83 Mb.

səhifə	3/23
tarix	07.09.2023
ölçüsü	11,83 Mb.
	#141855

1 2 3 4 5 6 7 8 9 ... 23

Abbreviations

Table of Contents

INTRODUCTION 1
SUBJECT THEORY 3
1. Segmentation 3
2. Alignment 3
3. HTR systems 4
4. Related Work 5
5. Machine learning for image segmentation 6
DATA & SOFTWARE 7
1. Complications with the data 7
2. Software & Python packages 8
3. Ethics and conflict of interest 9
ALIGNMENT ALGORITHM DEVELOPMENT 11
1. Page image preprocessing 11
2. Line segmentation 14
3. Word segmentation 16
4. Interactive correction 22
5. Self learning 24
6. User interface 25
PERFORMANCE EXPERIMENTS 27
RESULTS 31
1. Performance of the algorithm 31
2. Visualisation of the algorithm pipeline 35
DISCUSSION 39
1. Performance analysis 39
2. The importance of ground truth quality 40
CLOSING REMARKS 43
1. Conclusion 43
2. Future work 43

REFERENCES 46
SUPPLEMENTARY FILES 48

Abbreviations

GMM Gaussian Mixture Model
GT Ground Truth
HPP Horizontal Projection Profiling HTR Handwritten Text Recognition IoU Intersection over Union
ROI Region Of Interest

Introduction

The process of segmenting a document image into text lines or words and then align- ing them to create an annotation is an important preprocessing step in many cases of document understanding. Handwritten text found in historical records, such as tables, can not be automatically transcribed by traditional Optical Character Recognition sys- tems. Therefore, Handwritten Text Recognition (HTR) is being used, which has shown exceptional performance in fields such as text-line segmentation, keyword spotting, and character recognition, among others (De Gregorio et al. 2023). HTR models usually require large amounts of annotated data for learning. Such annotations can be obtained in a process called alignment, which makes use of manual transcriptions of some parts of the document, and links corresponding text images to transcribed words. This project started as part of an effort to digitize historical document images.

This thesis aims to develop a pipeline that covers the process of going from a raw his- torical document image to segmented words and then aligning the segmented words to a transcription or ground truth (GT) with interactive self-learning built into the algorithms. The start of a suitable user interface is also developed. The research objectives were:

Develop an interactive algorithm for the alignment of historical document images

Select appropriate methods for all parts of the pipeline

Integrate self-learning in the algorithm

Evaluate the performance of the algorithm

Key questions that were to be answered are: What factors affect the segmentation and alignment? How can self-learning be integrated?

The algorithm created for this thesis is intended to work on one image at a time and was primarily tested on the data set Labour’s Memory (Chapter 3). For the algorithm to work as intended, some conditions must be fulfilled for the images used. The text lines in the document images need to be approximately straight. The document images can only have a limited amount of noise; if there is too much noise in the image, the algorithms will not perform as intended. The words in the image can only overlap to an extent, although the characters in the words can overlap as the algorithm only needs to segment the image into words. The algorithms are only intended to segment lines and words, symbols such as [ , : ; ” - ) ( ] are not individually segmented as they are not

words.

Figure 1: An example image, taken from the Labour’s Memory dataset

Yüklə 11,83 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 23

Semi-automatic Segmentation & Alignment of Handwritten

Abbreviations

Introduction