Assignment 10: Parallel clustering

Due: 5:00pm, Friday, November 30. Value: 40 pts.

We studied the message-passing approach to concurrent computing, particularly with the actor model, but our examples have been “toy examples” without much sophistication. In this assignment we'll investigate how to adapt a more complex algorithm to message-passing.

Clustering algorithm

In particular, we'll study an algorithm for clustering: Given a set of points pts that aren't evenly distributed, we want to partition the points into clusters so that each cluster is reasonably compact. One important application of clustering is in marketing, where one gets a massive amount of purchase data with the hope of clustering customers into target groups; another application is in search engines, where we may want to cluster documents to pick representatives of different categories.

Lloyd's algorithm is an popular process for computing clusters. Given the set P of points to be clustered and a number k of clusters to find, Lloyd's algorithm begins by randomly selecting k “centers.” From there the algorithm iterates the following refinement process: For each point from P, we associate the point with the center to which it is closest; then for each center we average the points associated with it, and we move the center to the average. We continue iterating through this the refinement process until it barely moves any of the centers (say, each moves at most 0.0001 away).

The following Scala method implements this algorithm for two-dimensional points.

def cluster(ptsArray[(DoubleDouble)], numClustersInt):
        Array[(DoubleDouble)] = {
    // pick initial centers randomly
    val centersArray[(DoubleDouble)] = new Array(numClusters)
    for (i <- 0 until numClusters) {
        centers(i) = pts(i)
    }

    var biggestMove = 1.0
    while (biggestMove > 0.0001) {
        val centerDataArray[(IntDoubleDouble)] = new Array(numClusters)
        for (i <- 0 until numClusters) {
            centerData(i) = (00.00.0)
        }

        // Phase A: for each point, determine nearest center; update sums for center
        for (pt <- pts) {
            var bestCenterInt = -1
            var bestDistDouble = 1e100
            for (i <- 0 until numClusters) {
                val dist = distSquared(ptcenters(i))
                if (dist < bestDist) {
                    bestCenter = i
                    bestDist = dist
                }
            }
            val (numsumxsumy) = centerData(bestCenter)
            centerData(bestCenter) = (num + 1sumx + pt._1sumy + pt._2)
        }

        // Phase B: for each center, move to the average of associated points
        biggestMove = 0.0
        for (i <- 0 until numClusters) {
            val (numsumxsumy) = centerData(i)
            if (num > 0) {
                val old = centers(i)
                centers(i) = (sumx / numsumy / num)
                val delta = distSquared(oldcenters(i))

                if (delta > biggestMove) {
                    biggestMove = delta
                }
            }
        }
    }
    return centers
}

A distributed version

Translating this algorithm to the actor model begins with reimagining its duties spread among several actors. We'll work with actors of three different types.

Your job

Two files are distributed as part of this assignment.

You can use the command “scalac main.scala cluster.scala” to compile your program. After compiling successfully, you can execute it with the command “scala Main”.

The distributed program's output first includes the centers expected based on how the points are generated; you can basically ignore these. Then the distributed program executes the sequential code provided above; the output will include the centers as it completes each iteration, followed by the final result. Finally, it calls your parallel version, and it displays the centers computed by that. If your implementation is correct, the final parallel result should match the final sequential result almost exactly. (There may be minor differences in the twelfth decimal place or later due to floating-point arithmetic issues.)