The data set that is mainly being worked on in this project is the Labour’s Memory data set, which is the people’s movement archive for Uppsala County, also known as the Labour’s movement archive. Labour’s Memory consists of digitised and annotated documents from the period 1892 - 1985. Within the data set, there are 1836 .jpg images divided into 31 different folders corresponding to the specific department or area that the images relate to. Each document image has three additional files connected to it. The first is a .txt file containing the raw digitised transcription with correct line breaks. The second file is a .xml file containing metadata and information about the image. Coordinates for the text region and each text line, as well as a digitised transcription, can also be found. The third file contains information about when and how the document was processed for OCR. The third file also contains coordinates for the page, text line, baseline, word, and spaces.
The data sets accessed during the project are listed in Table 1. The data sets Labour’s Memory and Demokrati 100 are written in the Swedish language. In contrast, the IAM data set is written in the English language and is publicly available for non-commercial research purposes only. It is provided by the Research Group on Computer Vision and Artificial Intelligence INF, University of Bern (Marti & Bunke 2002). The data sets Labour’s Memory, and IAM were used for evaluation and testing during development, while Demokrati 100 was only used for internal testing of the algorithm during devel- opment.
Complications with the data
One crucial factor to consider is the complexity of the given data in the Labour’s Memory
data set. Two images from this data set can look very different depending on the period
Table 1: Information about the data sets used in the project.
Data set
|
Pages (no.)
|
Ground Truth
|
Format
|
Labour’s Memory
|
1836
|
transcript & word/line boxes
|
JPG
|
IAM
|
1539
|
transcript & word/line boxes
|
PNG
|
Demokrati 100
|
4487
|
transcript & word/line boxes
|
JPG
|
Year 1967 (b) Year 1899
Figure 3: Two example images taken from the Labour’s Memory data set which differ in styles
it was created in. The overall quality of the documents is very heterogeneous; although most images contain some form of noise, the amount varies a lot (3). Some documents are handwritten, while others are written using a typewriter. Handwritten documents are often very challenging to preprocess due to the variety found in handwriting. Some documents have page holes from a hole puncher, while some do not; if these holes are not removed during the preprocessing steps, they will most likely affect the positioning of the bounding boxes further down the pipeline. Additionally, in some documents, there is an overlap of characters between the lines. It is common that the characters ’f’ and ’g’ overlap with another text line, making the line and word separation more complex. Above, in Figure 3 are two different examples of images found in the Labour’s Memory data set with varying amounts of noise and in different styles.
Dostları ilə paylaş: |