Final Review A: Questions

Rfa.1.

What are CUDA and OpenCL for?

Rfa.2.

Characterize some differences of a typical GPU architecture versus a typical CPU.

Rfa.3.

Using the MIPS ll and sc instructions, write code that atomically swaps the value in register $s0 with the value stored at the memory address found in $s1.

Rfa.4.

Suppose we want to copy the value at the memory address found in $s0 to the memory address found in $s1. The natural approach is to use “lw $t0, ($s0)” followed by “sw $t0, ($s1)”. But in a multi-threaded application, other threads' work may be inserted between these two instructions, and those threads could change the memory at $s0; thus, the sw would be working with an old value when it is finally executed.

Using the MIPS ll and sc instructions, write code that accomplishes this copy atomically — i.e., ensures that the store works exactly with the load.

Rfa.5.

Symmetric multiprocessors are simpler to implement than distributed shared memory. Why, then, is distributed shared memory generally a better option when there are more than eight processors?

Rfa.6.

Suppose two cores execute the following threads having access to the shared-memory variables a and b.

initially, a is 0, b is 10
Thread AThread B
a = 1;
b = 11;
b = 12;
a = 2;

Assuming our cache is coherent, which of the following are possible final values for a and b?

ab
a.111
b.112
c.211
d.212
Rfa.7.

Describe the write invalidate protocol for cache coherency in shared-memory systems.

Rfa.8.

There are two basic ways of implementing snooping to support cache coherency in shared memory systems: write broadcast and write invalidate. What is the primary advantage of write invalidate over write broadcast?

Rfa.9.

Explain what the directory system for cache coherence entails. Why is it particularly applicable for a DSM (distributed shared memory) system?

Rfa.10.

In class we studied four potential network topologies: fully connected, grid, torus, and hypercube. Another alternative is a ring topology, where processors are essentially lined up in a circle. Suppose we have a ring topology over n processors.

a. What is the diameter of this topology?

b. What is the average distance between any two nodes of this topology?

c. How many direct connections does this topology include?

Final Review A: Solutions

Rfa.1.

CUDA and OpenCL both exist to provide an interface for writing programs that will execute directly on the graphics processing unit (GPU). A GPU can provide additional processing power, particularly for computational problems that are amenable to parallelization.

Rfa.2.
  • A GPU has several computation units (“thread processors”) that are all driven with the same instructions, though each has its own register set.

  • A GPU distinguishes between three types of memory: “private” memory available to just one computation unit, “local” memory that is available to one group of computation units (all driven by the same instruction stream), and “global” memory that is available across the processor. By contrast, a CPU has only registers (anologous to a GPU's private memory) and memory (analogous to a GPU's global memory).

  • A GPU includes aggressive multithreading, with 96 threads per multiprocessor being typical. By contrast, a CPU handles fewer threads at any one time, but it works more aggressively to discover instruction-level parallelism so that each individual thread can be completed more quickly.

Rfa.3.
again:  ll $t1, ($s1)
        move $t2, $s0
        sc $t2, ($s1)
        beq $t2, $zero, again
        move $s2, $t1
Rfa.4.
again:  ll $t1, ($s1)
        lw $t0, ($s0)
        sc $t0, ($s1)
        beq $t0, $zero, again
Rfa.5.

A symmetric multiprocessor relies on a single, shared memory bus shared among all processors for communication with the single shared memory. Beyond eight processors, the amount of traffic on this bus and the amount of traffic with the main memory becomes a bottleneck impeding the system's speed. A distributed system allows direct communication between processors along a switching network.

Rfa.6.

All four possibilities could occur. It is easy to come up with cases that lead to (a.), (c.), and (d.): We could execute all of B before A, or execute the first instruction of each before the second instruction of each, or we could execute all of A before B. However, (b.) is sequentially inconsistent, so there's no such sequence. However, if A and B occur at virtually the same time, the definition of coherence could permit a to eventually become 1 while b would eventually become 12.

Rfa.7.

In the write invalidate protocol, whenever a processor wishes to write to a memory location, the cache first broadcasts onto the bus that other caches should invalidate their copies. From then on, the cache considers itself to have an exclusive copy of that data, so that subsequent writes to that memory location can reside purely in the cache. When another processor requests the memory location, or when the processor decides that it is time to use the cache location for another purpose, then the processor sends the actual value written onto the bus.

Rfa.8.

Write invalidate's primary advantage is that it requires less traffic for multiple writes to the same cache line: With write broadcast, every write to a cache line must be broadcast onto the bus, but with write invalidate, only the first write needs a broadcast on the bus. (Of course, the value must be broadcast later when the cache decides no longer to include that memory location, or when another cache needs to read from that memory location.)

Rfa.9.

In a directory system, there is a central location for each line of memory that tracks which processors' caches hold a copy of that line. Having a central location allows processors to determine the status of how a line is shared. This is particularly relevant for a DSM system, since messages are not broadcast to all other procesors (across a bus) and so needs to know which other processors to communicate with.

Rfa.10.

a. n / 2, since for any node the furthest node is halfway around the ring.

b. n / 4, since for any node there is one node that is 0 way, one that is n / 2 away, and two nodes for each integer distance between 0 and n / 2.

c. n.