THE 3 rd INTERNATIONAL SCIENTIFIC CONFERENCES OF STUDENTS AND YOUNG RESEARCHERS dedicated to the 99
th
anniversary of the National Leader of Azerbaijan Heydar Aliyev
127
structure, as the multitude of words can be generated by adding suffices or affixes to the
words. Therefore, traditional way of dictionary lookup based on string similarity is not suitable
for Azerbaijani language. Overall, this paper discusses the implementation of the spell check
system built with the utilization of Levenshtein’s algorithm with Recurrent Neural Networks.
The character-based sequence to sequence mode is developed which consists of 3 main
parts – encoder, decoder, and attention mechanism. For training the model, a dataset
containing 15,000 wrong and correct sentence pairs in Azerbaijani was collected by utilizing
scraping techniques. After training the model on the dataset,it is tested on 1000 misspelled
sentences and 94% accuracy is accomplished as a final result.
Nowadays, the transmitted information through emails, social media and
chat applications contains many orthographical errors. It is generally hard to
focus on reading a content that has many grammatical and spelling mistakes.
Therefore, developing reliable spelling checker for users who create the
content in Azerbaijani language in their social media, blogs can eliminate the
need of allotting extra time and checking manually the texts they have written.
Firstly, the typing behaviour of Azerbaijani people was thoroughly
analysed on social media to determine the most common spelling mistakes
they do. It is found that “ə ö ü ğ ı ç ş” are the most widely misspelled letters
in the words since most Azerbaijani people use English keyboard. For
example, the letter “ş” is usually misspelled as “w”, “s” or “sh” by people. To
sum up, instead of substituting a letter with random letter, it is replaced with
a letter that is confused and mistyped most of them time.
The main challenge was that there was not a dataset consisting of
labelled Azerbaijani sentences available on data platforms like Kaggle.
Therefore, the dataset was created by applying web scraping techniques to
collect the content available in Azerbaijani pages and blogs on social media
platforms like Facebook, Twitter. Their correct pairs were manually added on
their adjacent fields, and then data augmentation techniques were applied to
handle data scarcity problem by generating wrong sentences from correct
sentences with the technique of substituting the letters with their most
misspelled versions discussed above. It helped to expand dataset that can
consider all the possible misspelled version of words according to the typing
behaviour of people and improve accuracy of the model.
Researchers have been working on spelling checker topic for a long time
and they introduced traditional techniques like dictionary lookup based on
string similarity and distance metrics. Levenshtein distance algorithm is
utilized which works on the basis of edit distance to measure the similarity
between two strings. The most straightforward method for spell checking is
to construct a word dictionary that contains all the correctly written words and
utilize similarity checking algorithm to find most similar words from dictionary
to the target word. For that, the time cost should be considered, as it matches
the given words with every word from the dictionary.