Tezislər / Theses



Yüklə 17,55 Mb.
Pdf görüntüsü
səhifə117/493
tarix02.10.2023
ölçüsü17,55 Mb.
#151572
1   ...   113   114   115   116   117   118   119   120   ...   493
BHOS Tezisler 2022 17x24sm

THE 3
rd
 INTERNATIONAL SCIENTIFIC CONFERENCES OF STUDENTS AND YOUNG RESEARCHERS 
dedicated to the 99
th
anniversary of the National Leader of Azerbaijan Heydar Aliyev
127
structure, as the multitude of words can be generated by adding suffices or affixes to the 
words. Therefore, traditional way of dictionary lookup based on string similarity is not suitable 
for Azerbaijani language. Overall, this paper discusses the implementation of the spell check 
system built with the utilization of Levenshtein’s algorithm with Recurrent Neural Networks. 
The character-based sequence to sequence mode is developed which consists of 3 main 
parts – encoder, decoder, and attention mechanism. For training the model, a dataset 
containing 15,000 wrong and correct sentence pairs in Azerbaijani was collected by utilizing 
scraping techniques. After training the model on the dataset,it is tested on 1000 misspelled 
sentences and 94% accuracy is accomplished as a final result.
Nowadays, the transmitted information through emails, social media and 
chat applications contains many orthographical errors. It is generally hard to 
focus on reading a content that has many grammatical and spelling mistakes. 
Therefore, developing reliable spelling checker for users who create the 
content in Azerbaijani language in their social media, blogs can eliminate the 
need of allotting extra time and checking manually the texts they have written. 
Firstly, the typing behaviour of Azerbaijani people was thoroughly 
analysed on social media to determine the most common spelling mistakes 
they do. It is found that “ə ö ü ğ ı ç ş” are the most widely misspelled letters 
in the words since most Azerbaijani people use English keyboard. For 
example, the letter “ş” is usually misspelled as “w”, “s” or “sh” by people. To 
sum up, instead of substituting a letter with random letter, it is replaced with 
a letter that is confused and mistyped most of them time. 
The main challenge was that there was not a dataset consisting of 
labelled Azerbaijani sentences available on data platforms like Kaggle. 
Therefore, the dataset was created by applying web scraping techniques to 
collect the content available in Azerbaijani pages and blogs on social media 
platforms like Facebook, Twitter. Their correct pairs were manually added on 
their adjacent fields, and then data augmentation techniques were applied to 
handle data scarcity problem by generating wrong sentences from correct 
sentences with the technique of substituting the letters with their most 
misspelled versions discussed above. It helped to expand dataset that can 
consider all the possible misspelled version of words according to the typing 
behaviour of people and improve accuracy of the model.
Researchers have been working on spelling checker topic for a long time 
and they introduced traditional techniques like dictionary lookup based on 
string similarity and distance metrics. Levenshtein distance algorithm is 
utilized which works on the basis of edit distance to measure the similarity 
between two strings. The most straightforward method for spell checking is 
to construct a word dictionary that contains all the correctly written words and 
utilize similarity checking algorithm to find most similar words from dictionary 
to the target word. For that, the time cost should be considered, as it matches 
the given words with every word from the dictionary.



Yüklə 17,55 Mb.

Dostları ilə paylaş:
1   ...   113   114   115   116   117   118   119   120   ...   493




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin