A comparative study of support vector machine and logistic regression article · January 021 citations reads 11 authors



Yüklə 395,89 Kb.
Pdf görüntüsü
səhifə5/8
tarix22.05.2023
ölçüsü395,89 Kb.
#120027
1   2   3   4   5   6   7   8
16ACOMPARATIVESTUDYOFSUPPORTVECTORMACHINEANDLOGISTIC

1.4
Aim and Objectives 
This report aims to provide a comprehensive comparison of the classification accuracy of the 
support vector machine (SVM) and the logistic regression (LR) applied to different data sets of 
different sample sizes and composition. 
The specific objectives are;
To find out which method is consistently better than the other. 
To find out if any of the classifiers consistently performs better than the other.
To find out if specific datasets would require the use of a particular classifier than the other. 
To compare their accuracies. 
To draw conclusions based on the results of the comparison. 
2.0 Research Methodology 
In this section, we focus on how this study has been designed to ensure that valid and reliable 
results that address the research aim and objectives are obtained. Discussions here will essentially 
include the classifications tools used in this study and the secondary datasets for analysis. We shall 
also discuss a procedure for arriving at a valid conclusion in data analysis. 




O
RIENTAL JOURNAL OF SCIENCE & ENGINEERING VOL -2, ISS-1, FEB - 2021 
www.ojse.org
ojse©2019 
Page 89 
2.1 Classification Methods 
The classification methods considered in this study are: 
The support vector machine (SVM) and 
The logistic regression (LR). 
Both tools have been introduced in section one. We aim to use them to carry out classification on 
eighteen different datasets. In the end, we shall find out which tool outperforms the other, or whether 
both classification procedures essentially tally in their performances. 
3.0 
Results from Data Analysis 
Presented in the Table 4.1 are the accuracy rates of SVM Linear Kernel (SVM_LK), SVM 
Gaussian Kernel (SVM_GK) and LOGREG (LR) on each dataset. The accuracy rates were 
acquired by utilizing a two-fold cross validation (training and test sets). 70% of each dataset was 
used to train each classifier and the remaining 30% used as test set. The table consists of the 
outcome acquired from using the classifiers (SVM_LK, SVM_GK and LR) to carry out 
classification on all the datasets employ in this analysis. The software applied for this motive is R 
and the various programs on R for this research are contained in the Appendices.
Based on the Table 4.1, it seems that the performances of the classifiers are neck and neck on most 
datasets. On a few cases, the SVMs appeared to outperform LR. For instance, on datasets, Bupa, 
Air quality and Infant mortality, the SVMs clearly outperformed the LR. Conversely, on dataset 
Cmc and Saheart the LR outperformed the SVM using Linear Kernel. The mere fact that the SVMs 
only marginally performs better than LR on several other datasets clearly underscores how 
powerful both classifiers are on the datasets. All the datasets where SVMs marginally 
outperformed include Banana, Titanic, Heart disease, Monk, Tae, Pima and Haberman, 
immunotherapy datasets. 
In general, one is left to assume that no classifier genuinely outperformed the other. However, in 
order to ascertain if this assumption is true, an anova test will be carried out. The test will take into 
consideration the following null and alternative hypothesis: 
The average performances of the three classifiers do not differ from each 
other ( 
). 
is
not true.
For this test to proceed, we must ensure that the assumptions underpinning the test are met. The 
first is normalcy assumption and on that, a Shapiro-Wilk normality test (Appendix C.1) shows that 
at a p-value of 0.9046 and 0.9119 respectively, the conjecture that data on the performances of 
SVMs and LR have a normal distribution cannot be rejected.
Another assumption to investigate is the assumption of equal variances. To confirm if the 
assumption holds or not, we carry out F-test on equality of two variances in R. The test shows that 
at a p-value of 0.5585 (Appendix C.2), the null hypothesis that both variances are equal, given the 
datasets, cannot be rejected.
Prior to this, however, the boxplot of Figure 4.1 shows that the median of SVM using the Gaussian 
Kernel (SVM_GK) recorded the higher median followed by the SVM using the linear kernel 




O
RIENTAL JOURNAL OF SCIENCE & ENGINEERING VOL -2, ISS-1, FEB - 2021 
www.ojse.org
ojse©2019 

Yüklə 395,89 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin