Doing the search

So far, we have disparity measures for judging whether a group of centroids or regions are discriminative. However, we still do not have a search algorithm to find the set of points. As pointed out in Section 4.6, the problem is that the search space is exponential in the number of instantiated features. As previously discussed, two general approaches are possible: undirected and undirected segmentation.

With undirected segmentation, any clustering or density estimation algorithm can be used. There are some nice algorithms for clustering, especially if the number of clusters is known, such as K-means clustering. The algorithm for K-means is shown in Figure 4.11. Alternatively, other algorithms, such as AutoClass [CS96] and SNOB [WD94] could be used.

**Figure 4.11:** K-means algorithm.
$\fbox{ \begin{minipage}{5in} Inputs:\\ \hspace{1cm}$I=\{i_1,...,i_k\}$\ (Insta... ...\\ \hspace{2cm}End \\ \hspace*{1cm}End\\ return $C$\ \\ End \end{minipage}}$

The algorithm works by randomly selecting centroids, finding out which elements are closest to the centroid, then working out the mean of the points belonging to each centroid, which becomes the new centroid. Region membership is checked again, and the new centroids are computed again. This operation continues until there are no points that change their region membership.

The problems with the K-means approach is that (a) we have to specify the number of clusters ahead of time (b) it can be sensitive to the initial random choice of centroids (c) it does not make any use of class information. Note also that it makes use of a special type of locality or stability in the way it works: region membership is unlikely to change too much, and the information is slowly refined to come up with the final solution.

The K-means algorithm can be made to look at a range of different number of clusters by the addition of a ``quality'' heuristic. The algorithm is relatively simple. Let us say we have a range of say, 2 to 10 clusters. Then we try to get the best possible clustering with 2 clusters, take a global measure of quality of clustering (e.g. total distance from centroids) and then try again with 3 centroids, etc. until we find the number of clusters that work the best.

Most importantly, it can be shown that under the assumption of there actually being

centroids with Gaussian distribution, that the K-means algorithm will eventually converge to the correct answer. ``Correct'' in this case means the clustering that minimises the total distance of all points from the centroids.

However, K-Means does not make use of a disparity measure; so we would like to adapt it to take advantage of one. The problem is that we do not have this easy-to-define and ``smooth'' criterion. The K-means algorithm is an estimation-maximisation algorithm; it gradually moves towards a more optimal solution by using the previous iteration to ``inform'' the next iteration. However, in the case of directed segmentation, the solution space is not smooth - the disparity measure can vary irregularly, since it involves the choice of different centroids. There is no easy ``pattern'' in the solution space to exploit.

The easiest algorithm to implement would be a ``random search'' algorithm. This random search algorithm is shown in Figure 4.12. While it may first seem that using a random search algorithm is not productive, work by Ho [HSV92] in the field of ordinal optimisation shows that random search is an effective means of finding near-optimal solution. This was also used by Srinivarsan [Sri00] in his Aleph ILP system where it was shown to perform well even compared to complex search methods.

Ordinal optimisation can be applied to our problem as follows. Let us assume that our objective is to find a subset of points that is in the top

per cent of possible subsets (i.e. whose disparity measure ranks it in the top

per cent). If we test

possible subsets randomly, then the probability

of getting an instance in the top

per cent can be shown to be:

$\begin{displaymath} \begin{array}{ccc} P(\mathrm{1\ or\ more\ subsets\ in\ top\ ... ...bsets\ in\ top}\ k\ \%) \\ & = & 1-(1-k)^n \\ \end{array}\end{displaymath}$

For example, if

and $k=1\%$ then the probability of finding an instance in the top 1 per cent of possible cases is $1-(1-0.01)^{1000} = .999957$ . In other words, if we try 1000 instances, then there is a better than 99.995 per cent chance that we will get one instance in the top 1 per cent. Hence by softening the goal (trying to find one of the best rather than the best subset of points) we can be highly confident of getting a solution in the top percentile.

**Figure 4.12:** Random search algorithm.
$\fbox{ \begin{minipage}{5in} Inputs:\\ \hspace*{1cm}$I=\{i_1,...,i_k\}$\ (Insta... ...m}End\\ $C$\ := $\mathit{bestCentroids}$\\ return $C$\ \\ End \end{minipage}}$

An algorithm for random search is shown in Figure 4.12. It assumes the following auxiliary functions:

Also note that $\ensuremath{\mathit{DispMeas}}$ takes a set of centroids, a set of points and their class labels. Using this information, it then creates a contingency table as mentioned above, and then applies whatever disparity measure is required.

The random search is not a particularly efficient algorithm in terms of search, compared to K-Means. However, we will show that the advantage of using directed segmentation over undirected segmentation means that in practice, directed segmentation is preferred.

Also note that the random search algorithm is highly parallelisable. The main for loop can be run in parallel across many different processors and/or computers, provided that the final if statement that updates the best instance seen so far is done atomically. This is a significant advantage over the K-Means algorithm, which is not easy to parallelise at all, as it is inherently iterative.

In either case, whether using the undirected segmentation or the directed segmentation, the centroids become synthetic features; in other words, we can now ask the question: ``does a particular stream have an instantiated feature lying in the region belonging to centroid X?'' This binary information can be used as a synthetic feature that can be passed on to a learner. We will explore this in practice in the next two chapters.