To store the resulting alignment information in a structured way, the XML file format is used, which stands for eXtensible Markup Language. One of the more common XML schemes used for document layout analysis is the ”Analyzed Layout and Text Object” schema. The METAe project group initially developed it to create a standard for handling word positions and physical layout positioning (library of Congress 2016). This is also the basic XML structure used in this project. The transcript for the page, each line, and each word, is stored and numbered in order, along with the coordinates for the bounding boxes for all words (Figure 12).
Interactive correction
Each image processed by the algorithm requires different parameter settings to obtain a decent result. It was found that two parameters were significant. Firstly, the threshold value for the binarisation in the noise removal step of preprocessing impacts the result widely because it decides how much information to keep or remove depending on the pixel intensity. The second parameter is the minimum value for a gap to be considered a blank space in the word segmentation; this value decides which combined components are considered words. In the algorithm, these parameters can be interactively chosen
Figure 13: Visualisation of the three types of segmentation errors. a) Oversegmentation. b)
Undersegmentation, c) Correction
during the process with the use of a window and a slider. By adjusting the slider, and therefore adjusting the parameter in question, the window shows the effects that the chosen parameter value has on the image.
As the segmentation is not always completely correct, it is important to be able to adjust the errors. There are generally three types of errors that could occur (Figure 13); over- segmentation, when there are too many bounding boxes, for example, when one word might be split by two bounding boxes. Undersegmentation is when there are too few bounding boxes, meaning two words might be contained in only one box, or that one word might be missed. The third error is when correction of a box is required to correctly encapsulate the intended word.
For this reason, a method was been developed for resolving segmentation errors. The method allows the user to be able to remove and add both text line and word bounding boxes. Removing a box is accomplished with a simple click to mark the box. Adding a box is accomplished using the package opencvdragrect by (Chavan 2021), which allows for rectangles to be drawn with a click-and-drag motion.
Since it might be difficult to perfectly encapsulate a word with a box when adding a new bounding box, a method developed by (Vats & Hast 2017) was used. This method uses the connected components of the selected box to choose the corrected coordinates. If the
IoU between the component and the selected box is greater than 0.01, this component is considered part of the word. The extent of all component coordinates in every direction is checked, and the extreme values for each direction (x1, x2, y1, y2) are chosen as the corrected coordinates.
All of the alignment algorithms listed in 4.3 are dependent on the existence of a GT to produce an output. If a GT does not exist for a page, it must be generated. Therefore the option to manually transcribe the text on the fly is available. An image of the document page is shown, and the transcription is performed line-wise. The current line to be tran- scribed is marked with the bounding box, and all words in the line are marked with their respective bounding box, as shown in Figure 20. Once one line has been transcribed, the algorithm iterates to the next, until the transcription is finished. Linear alignment (see 4.3.1) is then utilised to align the newly made transcript and the word bounding boxes.
Dostları ilə paylaş: |