Terminology:

Regression: Regression is a statistical approach for determining the relation between two attributes.

Linear Indetermination coefficient: Using the linear indetermination coefficient regression quality can be determined. Higher the indetermination coefficient lower the quality of regression.

Clustering: Clustering is a process of partitioning the data points into meaningful subclasses called clusters.

Introduction:

The aim of writing this report is to give a detailed description about a new clustering algorithm that has been introduced in order to deal with partitioning the dataset into clusters under the condition that, the regression indetermination coefficient for each cluster results a minimal value. I am also going to discuss the proposal and implementation of this algorithm. This clustering algorithm is developed with generic programming approach. We use generic algorithms because it provides the sub-optimal solutions. Most of the clustering algorithms try to minimize the inter clustering and intra clustering distances, However a few applications such as cellular network planning require to decrease the indetermination coefficient. Regression is one of the most widely used techniques for data analysis. It has a wide application range from experimental data to modern data mining. Below I am discussing the how the regression is helpful in dividing the data into clusters with an example.

Idea of considering regression as a clustering method with an Example:

The population of the fish has been considered during the ichthyology research. We assume all the fish belongs under one species because it is impossible to describe for the attributes like size, shape, color etc., of fish. Under the condition – “effect of water temperature on the average swimming speed of the fish” we acquire a set of data after several measurements. Now we try to delineate (Figure 1) the data obtained considering speed of fish on Y axis and temperature on X axis and build a function. One of the approaches is to build a linear regression function for the data. After acquiring the function we try to find the relation that exists between the average swimming speed and temperature. The linear regression function obtained in this scenario helps to claim that “as the temperature increases there is a decrease in the average speed of fish” though it results a high error rate because of the residuals.

However, there exists another approach called clustering. Having a closer look at the Figure 1, we can come up with an idea of splitting the data (fish population) into two clusters. Now we try to obtain a linear regression functions for two clusters separately until we get a satisfactory (minimized) regression indetermination coefficient instead of getting a rough regression function for the whole population. After performing the clustering we assume that the fish belong to different species which react differently at different temperatures. The regression functions formed have a regression indetermination coefficient which is nearly equivalent to zero which means that the these functions are a good fit model for this experimental data. Below I have delineated (Figure 2) how well the regression lines fits the two clusters formed.

The above discussed procedure works well when the data set is small, also to provide a good fit model with minimized linear regression indetermination coefficient and much better relationship that exists between two attributes. Unfortunately the datasets with which we deal in real world i.e., in the data mining applications it is rare to find a small data set, so that we can make the analysis manually. We cannot just predict the number of clusters that should be formed from the dataset just by visualizing, so we require this clustering algorithm in order to find the accurate relation between attributes. This algorithm considers error rates to measure the quality.

Proposal :

When the dataset is very huge it is manually impossible to perform visualization and decide the number of number of clusters to be formed as we saw in the example1, so we need an automated system that can predict the number of clusters. Another approach that can be followed is to perform the algorithm repeatedly as we perform in k means clustering and consider the best solution. Regression function parameters fails to determine the distance between the dataset items, but can be used to determine the fitness function which is used as cluster quality measure. To further solve the problem we apply genetic problem approach using the cluster quality measure.

Assumptions:

1. Let’s assume there are two attributes in the given input dataset and the number of clusters that we have to divide is two.

2. Consider that D is the dataset containing objects {x,y} which have to be clustered.

D = {d1,d2,d3,d4……,dn}

3. Assume U is the binary vector which represents the distribution followed by the objects in D and split them into two sets C1 and C2 clusters.

U = {U1,U2,U3,….Un}

If Ui is 0, di belongs to C1

If Ui is 1, di belongs to C2

4. Let R be a quadruple consisting the coefficients (a,b,c,d) of the formed linear regression functions.(fr1 and fr2) . We can also consider R as a binary vector, by changing the coefficient values to binary.

5.We denote P as a set of all possible vectors and name it objects population

6.We denote Q as the set of all vectors of R and name it function population.

Using the above mentioned assumptions I am explaining the algorithm in 2 steps.

Step1: First we populate the P and Q with random data.

Every vector of P is compared with binary vector U and grouped as cluster1(C1) and cluster2 (C2) based on the binary vector U. Regression functions (f1 and f2) are computed for both the clusters. Regression Indetermination coefficients(fi1 and fi2) of regression functions are computed respectively. Then we find the quality measure of each cluster with fit function, the cluster with maximum quality is considered. We perform genetic operations on P vector using the fit function. The best individuals are selected from the output of genetic operations and form the regression function(fr1,fr2) and the coefficients are sent into vector R. We insert the vector R values into Q vector.

Step2:

Using the above formed regression indetermination coefficient (fr1,fr2) and the dataset D we find the Euclidean distance and based on this distance we divide them into clusters C1 and C2 and now find the regression function(f1,f2) and regression indetermination coefficients (fi1,fi2) and use the fit function to determine the quality. And perform the genetic operation on Q vector and get the individual Ux using the C1 and C2 distributions for the best individuals from Q and insert Ux into P

Both step1 and step2 are performed repeatedly until we reach the satisfactory indetermination coefficients.

Finally this algorithm mainly competes two populations, i.e.,

Object population P: This contains the individuals evolving the distribution of dataset objects.

Function population Q: This contains evolving the regression functions.

The best individuals are transferred between these two populations after the conversion.

When we have multiple dimensions:

In a multi-dimensional space it is hard to recognize the linear patterns. In multiple regression function it forms a hyper plane which is hard to visualize. The dependent variable is a vector and the independent variable is a matrix. As the dimensions increase it is polynomially hard to implement using the algorithm. Instead regression approximation algorithm can be used when we are dealing with multiple dimensions.

Conclusion: The algorithm discussed above cannot be considered as universal clustering methodology. When dealing with classical regression if unsatisfactory results are obtained this algorithm can be used to show the correlation between attributes. This algorithm is still in it first stage of development. A better way should be found for the automatic selection of clusters. There has be enhancements performed to the generic programming approach and fitness function.