/MIL

BE Degree Project

Primary LanguageJava

Designing  a  Data Discretizer for Classification Problems

Project Group:
*	Project guide – Prof. Bikash K. Sarkar
*	Team members – 
1.	Gourab Mitra (BE/1232/08) [ gourab dot mitra at live dot com ]
2.	Shashidhar Sundareisan (BE/1343/08) [ shashidhar dot sun at gmail dot com ]

Manuscript at https://arxiv.org/abs/1710.05091

ABSTRACT
	Most  of  the  real-world  problems  involve  continuous  attributes, each  of  which  possesses  more  values. Due  to  exponential growth  of data  in  the  database  systems,  operating on continuous  values  is  comparatively  complex in comparison  to  discrete  values. Therefore, conversion of input  data  sets  with continuous  attributes  into  data  sets with  discrete attributes  is  necessary to  reduce the  range of  values. This  is the goal of data  discretization.
	Several machine learning algorithms have been developed to our knowledge. However, many of them cannot handle continuous attributes, whereas each of them can operate on discretized attributes. Even if an algorithm can handle continuous attributes its performance can be significantly improved by replacing continuous attributes with its  discretized values. The other advantages  in operating  discretized  attributes are  the  need  of  less memory space and processing  time  in comparison   to  their  non-discretized  form. In  addition,  much  larger  rules  are  produced, while  processing  continuous attributes.    On  the  other  hand, a  disadvantage  of  discretizing  a continuous value in some cases, is  the loss of  the information  in   the available original  data. For example, two different values in  the  same discretization  interval  are  considered  equal, even  though  they  may  be at  two  different  extremes  of  the  interval. Such  an  effect  can  reduce  the  precision  of  Machine Learning  algorithm. The loss  of  information, in  fact,  increases   the  rate  of  misclassification.
	In  general, most  of  the  discretization  processes  lead  to  a  loss  of  information and  can't reduce values  of  discrete attribute. 
	The  aim of the present study is to design an  efficient supervised approach to achieve high classification accuracy, and to reduce loss of  information as well as the number of intervals of attributes.