O
RIENTAL JOURNAL OF SCIENCE & ENGINEERING VOL -2, ISS-1, FEB - 2021
www.ojse.org
ojse©2019
Page 89
2.1 Classification Methods
The classification methods considered in this study are:
The support vector machine (SVM) and
The logistic regression (LR).
Both tools have been introduced in section one. We aim to use them to carry out classification on
eighteen different datasets. In the end, we shall find out which tool outperforms the other, or whether
both classification procedures essentially tally in their performances.
3.0
Results from Data Analysis
Presented in the Table 4.1 are the accuracy rates of SVM Linear Kernel (SVM_LK), SVM
Gaussian Kernel (SVM_GK) and LOGREG (LR) on each dataset.
The accuracy rates were
acquired by utilizing a two-fold cross validation (training and test sets). 70% of each dataset was
used to train each classifier and the remaining 30% used as test set. The table consists of the
outcome acquired from using the classifiers (SVM_LK, SVM_GK and LR)
to carry out
classification on all the datasets employ in this analysis. The software applied for this motive is R
and the various programs on R for this research are contained in the Appendices.
Based on the Table 4.1, it seems that the performances of the classifiers are neck and neck on most
datasets. On a few cases, the SVMs appeared to outperform LR. For instance, on datasets, Bupa,
Air quality and Infant mortality, the SVMs clearly outperformed the LR. Conversely, on dataset
Cmc and Saheart the LR outperformed the SVM using Linear Kernel. The mere fact that the SVMs
only marginally performs better than LR on several other datasets clearly underscores how
powerful both classifiers are on the datasets. All the datasets where SVMs marginally
outperformed
include Banana, Titanic, Heart disease, Monk, Tae, Pima and Haberman,
immunotherapy datasets.
In general, one is left to assume that no classifier genuinely outperformed the other. However, in
order to ascertain if this assumption is true, an anova test will be carried out. The test will take into
consideration the following null and alternative hypothesis:
The average performances of the three classifiers do not differ from each
other (
).
is
not true.
For this test to proceed, we must ensure that the assumptions underpinning the test are met. The
first is normalcy assumption and on that, a Shapiro-Wilk normality test (Appendix C.1) shows that
at a p-value of 0.9046 and 0.9119 respectively, the conjecture that data on the performances of
SVMs and LR have a normal distribution cannot be rejected.
Another assumption to investigate is the assumption of equal variances.
To confirm if the
assumption holds or not, we carry out F-test on equality of two variances in R. The test shows that
at a p-value of 0.5585 (Appendix C.2), the null hypothesis that both variances are equal, given the
datasets, cannot be rejected.
Prior to this, however, the boxplot of Figure 4.1 shows that the median of SVM using the Gaussian
Kernel (SVM_GK) recorded the higher median followed by the SVM using the linear kernel