Line segmentation is the first part of the segmentation process, where the goal is to segment the text in the document image into text lines. Two different line segmenta- tion methods were developed, GMM-based line segmentation (Method 1), which was
Figure 7: Before (left) and after (right) when applying the noise removal algorithm on a document image.
heavily based on the work developed by (McKeen 2021), and the second method, HPP- Closeness-based line segmentation (Method 2). These two methods share the same pre- processing steps. Both methods use a technique called canny edge detection first. This technique finds the edges of an image and is mostly a preparation for the next step. Con- nected components with 8-way connectivity are used to find components in the ROI of the image. Two pixels are considered 8-connected if: i) they are 8-adjacent. ii) their grey levels satisfy a similarity criterion; they have equal intensity level (Gonzalez & Woods 2008). The connected components generated are then filtered on several factors to remove unwanted components generated from noise. They are filtered based on area, height, horizontal ratio, vertical ratio, and how close a component is to other components by y-value.
GMM-based line segmentation
In this method, several different techniques are used in order to get a robust segmenta- tion. The filtered components are used to perform a Gaussian mixture model (GMM) analysis which clusters the components by y-value. One parameter that needs to be cal- culated in the GMM is the number of classes to cluster into, which should equal the number of lines in the document. This number is calculated by using the horizontal pro- jection profile (HPP) of the image. For every pixel row, the pixel intensity is summed up and projected in a plot which creates a spectrum of peaks and valleys, as seen in Figure 8 (a). Each text line in the image represents a peak in the plot. The peaks are counted with the use of the Python method find_peaks from the package scipy.signals, which results in the predicted number of lines in the document.
For the method to correctly find peaks corresponding to a text line, the parameters height and distance need to be set reasonably. The height is at what minimum value a peak has to be to get counted, height=16 was found to work correctly in the cases that were tested. The distance is at what minimum distance two peaks can be to get counted, distance=100 was sufficient for the documents that were tested. In Figure 8 (b), an example output from the GMM analysis is shown. Once the y-values of the lines have been calculated, the connected components of the page are assigned to each y-value cluster. The coordinates for the line bounding box are the extremes for all directions (min(top), max(bottom), min(left), max(right)) of all components assigned to that line. The use of the HPP to get the number of lines in this method was developed during the thesis, while the other steps are heavily based on the work developed by (McKeen 2021)
Figure 8: (a) Plot after horizontal projection profile, each peak represents a peak in the number of black pixels corresponding to a line in the text. (b) Example of the output lines from the Gaussian mixture model.
Dostları ilə paylaş: |