# Assignment 10: Parallel clustering

Due: 5:00pm, Friday, November 30. Value: 40 pts.

We studied the message-passing approach to concurrent computing, particularly with the actor model, but our examples have been “toy examples” without much sophistication. In this assignment we'll investigate how to adapt a more complex algorithm to message-passing.

## Clustering algorithm

In particular, we'll study an algorithm for clustering: Given a set of points pts that aren't evenly distributed, we want to partition the points into clusters so that each cluster is reasonably compact. One important application of clustering is in marketing, where one gets a massive amount of purchase data with the hope of clustering customers into target groups; another application is in search engines, where we may want to cluster documents to pick representatives of different categories.

Lloyd's algorithm is an popular process for computing clusters. Given the set P of points to be clustered and a number k of clusters to find, Lloyd's algorithm begins by randomly selecting k “centers.” From there the algorithm iterates the following refinement process: For each point from P, we associate the point with the center to which it is closest; then for each center we average the points associated with it, and we move the center to the average. We continue iterating through this the refinement process until it barely moves any of the centers (say, each moves at most 0.0001 away).

The following Scala method implements this algorithm for two-dimensional points.

```def cluster(pts: Array[(Double, Double)], numClusters: Int):         Array[(Double, Double)] = {     // pick initial centers randomly     val centers: Array[(Double, Double)] = new Array(numClusters)     for (i <- 0 until numClusters) {         centers(i) = pts(i)     }     var biggestMove = 1.0     while (biggestMove > 0.0001) {         val centerData: Array[(Int, Double, Double)] = new Array(numClusters)         for (i <- 0 until numClusters) {             centerData(i) = (0, 0.0, 0.0)         }         // Phase A: for each point, determine nearest center; update sums for center         for (pt <- pts) {             var bestCenter: Int = -1             var bestDist: Double = 1e100             for (i <- 0 until numClusters) {                 val dist = distSquared(pt, centers(i))                 if (dist < bestDist) {                     bestCenter = i                     bestDist = dist                 }             }             val (num, sumx, sumy) = centerData(bestCenter)             centerData(bestCenter) = (num + 1, sumx + pt._1, sumy + pt._2)         }         // Phase B: for each center, move to the average of associated points         biggestMove = 0.0         for (i <- 0 until numClusters) {             val (num, sumx, sumy) = centerData(i)             if (num > 0) {                 val old = centers(i)                 centers(i) = (sumx / num, sumy / num)                 val delta = distSquared(old, centers(i))                 if (delta > biggestMove) {                     biggestMove = delta                 }             }         }     }     return centers } ```

## A distributed version

Translating this algorithm to the actor model begins with reimagining its duties spread among several actors. We'll work with actors of three different types.

• The `ManagerActor` is a single actor that coordinates the work of the other actors. It sends `GroupClassify` messages to `GroupActor`s to kick off Phase A of each iteration, and it sends `CenterUpdate` messages to `CenterActor`s when it's time to initiate Phase B of each iteration.

• Each `GroupActor` is allocated a subset of the query points to be clustered, and the actor basically handles the algorithm's Phase A: When it is given a list of the algorithm's current centers (using a `GroupClassify` message), it determines which center is closest to each of its points, and it notifies each center's actor about the points closest to it. Once it has completed, it notifies the `ManagerActor` with a `ManagerGroupDone` message. (The manager counts these messages it receives, and once it receives as many as there are `GroupActor`s it proceeds to Phase B.)

• Each `CenterActor` manages an individual center from the algorithm. It receives `CenterAddPoint` messages from individual `GroupActor`s telling it about the points closest to it. Later, it will receive a `CenterUpdate` message from the `ManagerActor`, which tells it to update the center's position and then send back a `ManagerCenterDone` message. (Again, the manager counts how many such messages it receives, and once it receives as many as there are `CenterActor`s it proceeds back to Phase A — unless the maximum distance moved is below 0.0001.)

You can use the command “`scalac main.scala cluster.scala`” to compile your program. After compiling successfully, you can execute it with the command “`scala Main`”.