Definition of quality criterion to separate clusters

The quality criterion represents the statistical security of the cluster separation. The basic idea to define this criterion can be described as follows (Gerstengarbe & Werner, 1997):

Подпись: Fig. 3. Principle scheme of the description of the clustering quality (red/blue - overlapped clusters, green - full separated cluster)

After having reached the local minimum, each cluster is equipped with a generally varying number of elements. Each element is defined by N parameters, i. e., it is located in a N – dimensional parameter space. As each cluster consists of a certain number of elements, they each represent a scatterplot of elements in the above space. If the clustering leads to a local secondary minimum, overlaps occur between the scatterplots of single clusters. The principle of this method is presented in Figure 3, which depicts the projection of two parameters within the N-dimensional space.

The number of overlaps O of the two clusters a and Ъ of N parameters can accordingly be defined as follows:

Подпись:La N a = 1 … k

O“’b = ZZ Z Oia,4,і Ъ = 2,…, k

la = 1 1ъ =1 j=1

1 Pl b, і [1] Pla, і

0 Plb, і < Pl,,і

Подпись: Qj Подпись: (7)


(Oq, b_- O)2 • (2 Pgr -1)_ (Oa, b + O) • (2 OX – Oa, b – O)

Подпись: XПодпись:Подпись: (11)under the additional condition

e1 > e2 >… > ek (8)

If Oab = 0, than the clusters a and Ъ are completely separated from each other. The maximum possible number of overlaps is

omax=NLaLb (9)

This number is reached if both clusters cover the same region within the N-dimensional space.

Thus by applying the equations (6) to (9) the quality of the separation of clusters can be determined statistically by the following steps:

• Calculation of the mean number of the maximum possible overlaps Omax as well as the mean actual number of overlaps O over all combinations of cluster pairs.

• Subsequently, a test is carried out to see whether O and Omax originate from the same basic population. Assuming that there is a normal distribution, Student’s t-test can be used. (Because of the necessary normalization of the parameters, a normal distribution is generally realized.) The null hypothesis implies that both mean values originate from the same population. The clusters can be separated only when the null hypothesis is rejected. Otherwise, the procedure is as follows:

Definition of quality criterion to separate clusters Подпись: (10)

The ratio va ъ of the actual to the maximum possible number of overlaps is determined for each cluster pair:

with the degree of freedom df = 1 .

The result of the test can be interpreted in the following way: If the calculated x2 – value is greater than a given threshold of significance, the frequency of overlaps exceeding the mean value O differs significantly from the X -value. The separation between the clusters is hence statistically not significant, in contrast to the other case where a statistically reliable separation exists.