Throughout the development of the algorithm, several Python packages and different software were used (Table 2). For specific versions and more practical information, see the GitHub repository Text_alignment_and_segmentation.
Table 2: The software & python packages used in the development of the algorithm
Python Packages
|
anvil_uplink
|
bayesian_optimization
|
beautifulsoup4
|
matplotlib
|
numpy
|
opencv_python
|
Pillow
|
scikit_learn
|
scipy
|
screeninfo
|
xmltodict
|
Ethics and conflict of interest
This project does not involve any conflict of interest nor needs for ethical approval.
Although, there are some ethical aspects to consider. The data used, where it comes from and what it contains often are significant from an ethical standpoint. The images in Labour’s Memory contain old activity reports and annual reports, which are public documents with no form of secrecy.
Alignment Algorithm Development
This chapter contains information about how the algorithm was implemented and de- veloped, including the techniques used and the complications that were met. Two methods were developed, which will be referred to as Method 1 and Method 2. In Figure 4, a brief overview of the algorithm pipeline can be seen. To be noted is that the steps in 4.1.2, 4.2.1, and 4.3 were heavily based on work developed by (McKeen 2021). For specific details about the algorithm, see the GitHub repository Text_alignment_and_segmentation.
Preprocessing is the first part of the pipeline, where the goal is to prepare the image for line segmentation. The preprocessing pipeline in this project includes page border removal, page hole removal, and noise removal.
Page border removal
Page border removal is performed to improve the quality of the image, which means removing any anomalies found at the image’s border, and framing the document more precisely. To accomplish this, the contours and edges of the image are used to find and sort the four document corners. A perspective transformation matrix is found between the original set of points and the new set. The image is transformed using the warpPer- spective method from OpenCV with the transformation matrix. The new image is then returned with the image corners equal to the corners of the paper. An example of the output can be seen in Figure 5. The code for this part was taken from (Kumar 2020).
Figure 4: A brief overview of the algorithm pipeline from the raw image to alignment with parts included.
Figure 5: Before (left) and after (right) when applying the page border removal algorithm on a document image.
Dostları ilə paylaş: |