MACHINE LEARNING: AN EXPERIMENT COMPARING CLASSIFICATION VERSES CLUSTERING ALGORITHMS

  • M. Loki Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
  • A. Mindila Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
  • W. Cheruiyot Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
Keywords: Classification, clustering, feature selection, machine learning

Abstract

There are four major ways to train a model in machine learning: supervised, unsupervised, semisupervised, and reinforcement learning. The intuition behind the performance of data mining and analysis in intelligent systems is the algorithm performing the task. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. This paper presents the results of an experimental study done on various classification versus clustering algorithms in pattern recognition area. Classification has a set of predefined classes and tries to establish which class a new object belongs to while clustering tries to group a set of objects and find whether there is some relationship between the objects. In the context of machine learning, classification is supervised learning and clustering is unsupervised learning. The experiment compared the performance, accuracy and time taken on training the models of the two learning algorithms. The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset present transactions that occurred in two days, where according to the banks own mechanisms there were 492 fraud cases reported out of 284,807 transactions. The data used in this analysis was highly imbalanced in nature and contained only numerical input variables which were the result of a principle component analysis (PCA) transformation. The results indicate that the classification algorithms have a higher kappa and balanced accuracy at discriminating fraudulent cases as compared to clustering algorithms. This is attributed to the aspect of principle component analysis (PCA) which scales down data and smoothens it for analysis. However, this aspect leads to overtraining. The classification algorithms are slower to interpret the numeric data given that they were subjected to test set which was 30 % of the training set as compared to clustering algorithms which were subjected to the training set.

Published
2019-04-17