Final: Questions

Xf.1.

[8 pts] Define the term instruction-level parallelism (ILP), and include at least one example of a processor design technique illustrating it.

Xf.2.

[10 pts] Suppose we have a stack architecture including LOAD (direct addressing), STORE (direct addressing), ADD, and MULT instructions. Supposing that x is stored at address 1000 and y is stored at address 1004, write a sequence of instructions corresponding to the C statement “x = x * x + y * y;”.

Xf.3.

[10 pts] Translate the following MIPS “program” into machine language (binary).

loop: lw $t0, 0($s0)
      addi $s0, $s0, -4
      bne $s0, $s1, loop
      nop
Xf.4.

[12 pts] The below MIPS subroutine accepts a subroutine parameter f and counts how many times f can be applied to its own result before we reach zero, starting with 1000. For example, if f is integer division by 10, it would return 4 (1000 / 10 / 10 / 10 / 10 = 0).

Though this is an admirable attempt, testing demonstrates that it does not work correctly. Explain what is wrong, and show how to repair the code.

repeat_to_zero:
        move $t0, $a0          # place f into $t0
        li $t1, 1000           # $t1 is number we're currently at
        li $t2, 0              # $t2 counts number of applications of f
cnext:  move $a0, $t1          # call f(current), result into $t1
        jalr $t0
        nop
        move $t1, $v0
        addi $t2, $t2, 1       # increment counter
        bne $t1, $zero, cnext  # repeat if current is still not zero
        move $v0, $t2          # return counter
        jr $ra
        nop
Xf.5.

[8 pts] Suppose we decide to add a new instruction pop to the MIPS instruction set to compute the number of 1 bits in the binary representation of a register's value, placing the result into a register. In an assembly program, it would appear as “pop $t0, $s1”. Define a system for encoding such an instruction while keeping our circuit implementation simple; explain your answer.

Xf.6.

[8 pts] Why do some instruction architectures choose to encode instructions into 16 bits, even though this severely restricts both the number of operations that the architecture can support and the number of registers?

Xf.7.

[8 pts] Assume that we can divide a pipeline for processing instructions into as many equal-sized pipeline stages as we want. Why might we decide it better to have fewer pipeline stages than possible?

Xf.8.

[12 pts] Suppose we have a 6-stage pipeline that can complete one instruction each clock cycle in the absence of any jumps or branches. (We assume there are no data or structural hazards.) For jumps or branches, the destination of the jump is known at the end of the pipeline's third stage, whereas the decision of whether a branch is taken is determined at the end of the pipeline's fourth stage.

Assume that 5% of instructions in a typical program are jump instructions, while 15% of instructions are branch instructions, of which 2/3 are taken.

a. What is the average number of clock cycles per instruction if we always stall for every jump or branch until the next instruction's location is known? Show your work.

b. What is the average number of clock cycles per instruction if we speculatively execute assuming that branches are not taken? Show your work.

c. What is the average number of clock cycles per instruction if we speculatively execute assuming that branches are taken? Show your work.

Xf.9.

[10 pts] When executing the following program, which happens to count the number of e's in the word eve, the first six branches occur as listed at right.

       li $t0, 1     ; count
       li $t1, 101   ; ASCII 'e'
loop:  lb $t2, ($s0)
       addi $s0, $s0, 1
       bne $t1, $t2, noadd   ; branch A
       addi $t0, $t0, 1
       j loop
noadd: bne $t1, $zero, loop  ; branch B
1:AN
2:AY
3:BY
4:AN
5:AY
6:BN

a. Suppose we predict branches using a local prediction buffer, where each entry is a two-bit predictor initially configured to predict “weakly no” (that is, predict “no” though the last observation was “yes”). Which branches will be predicted incorrectly? Show your work.

b. Suppose we predict branches using a three-bit global branch history, where each entry is again a two-bit predictor initially configured to predict “weakly no.” The previous several branches have been “yes.” Which branches will be predicted incorrectly? Show your work.

Xf.10.

[10 pts] Give an example of a MIPS instruction sequence where out-of-order execution improves performance on a 5-stage pipeline. Explain how out-of-order execution would help.

Xf.11.

[8 pts] Define the term simultaneous multithreading in processor design?

Xf.12.

[10 pts] Assume we have a vector processor with vector registers of length n where n ≥ 16. The processor has one load/store unit (latency 7) and one addition unit (latency 3). Assuming the processor does not avoid latency between consecutive instructions, but it does support chaining, complete the following table showing in which clock cycles each indicated element is in an execution unit.

instruction first elt last elt
lv      V1, ($s0) 0…
lv      V2, ($s0)
addvv.d V3, V1, V2
sv      V3, ($s0)
Xf.13.

[8 pts] In a graphics processing unit (GPU), distinguish the terms “private memory,” “local memory,” and “global memory.”

Xf.14.

[10 pts] Using the MIPS ll and sc instructions, write a fragment of MIPS assembly code that atomically waits until the value at the memory address contained in $s0 is 0 before setting that memory to 1. (This is a basic “lock” operation.)

Xf.15.

[10 pts] Define the snoopy approach to cache coherence, and explain why is it usually applicable only to systems with less than eight cores.

Xf.16.

[8 pts] Distinguish the terms cache coherency and cache consistency in multicore processors.

Xf.17.

[10 pts] Suppose two cores execute the following threads having access to the shared-memory variables a and b.

initially, a is 1 and b is 3
Thread AThread B
Store 6 into b. Load a into register $t0.
Store 2 into a. Load b into register $t1.
Display $t0 + $t1.

Assuming our cache is sequentially consistent, what are the possible results displayed by Thread B? Explain each answer.

Xf.18.

[10 pts] Suppose we have a set of text files, each corresponding to a class offered at Hendrix, where each text file includes one student ID per line followed by that student's letter grade in the course. Describe how to use MapReduce/Hadoop to compute each student's GPA (grade point average): We average the grades where an A is worth 4, B 3, C 2, D 1, and F 0. (You can assume all grades are A/B/C/D/F, that all courses have the same credit value, and that students never re-take a course to replace a former grade.) For example, if a student earned an A in one course, a B in two other courses, and an F in another course, then that student's GPA is computed as (4 + 3 + 3 + 0) / 4 = 3.0 — we divide by four since this student has taken four courses.

Final: Solutions

Xf.1.

Instruction-level parallelism refers to opportunities to execute simultaneously different instructions from the same instruction stream, even though the stream is written as if it is sequential. Pipelined and superscalar architectures are two techniques that use ILP to enhance processor performance.

Xf.2.
LOAD 1000
LOAD 1000
MULT
LOAD 1004
LOAD 1004
MULT
ADD
STORE 1000
Xf.3.
100011 10000 01000 00000 00000 000000
addi $s0, $s0, -4 001000 10000 10000 11111 11111 111100
bne $s0, $s1, loop 000101 10001 10000 11111 11111 111101
nop 000000 00000 00000 00000 00000 000000
Xf.4.

Calling f in the jalr instruction can potentially change all caller-save registers (and it will certainly change $ra). But the subroutine as written depends upon $t0, $t1, $t2, and $ra all retaining their original value, even though they are caller-save registers.

A partial solution is to use callee-save registers $s0, $s1, and $s2 wherever the original code uses $t0, $t1, and $t2. But using these requires that the subroutine restore these registers' original values (as well as $ra) before returning. So another required step would be to add code at the beginning to push the old values onto the stack, and to add code before “jr $ra” to restore the pushed values. The added code at the beginning would be:

repeat_to_zero:
        addi $sp, $sp, -16
        sw $ra, 12($sp)
        sw $s0, 8($sp)
        sw $s1, 4($sp)
        sw $s2, 0($sp)
        move $s0, $a0       ; original first instruction, using $s0 not $t0

And at the end would be:         move $v0, $s2       ; original next-to-last instruction, using $s2 not $t2
        lw $ra, 12($sp)
        lw $s0, 8($sp)
        lw $s1, 4($sp)
        lw $s2, 0($sp)
        addi $sp, $sp, 16
        jr $ra
        nop

Xf.5.

Since this instruction only has register arguments, it should be an R-type instruction. The first six bits should therefore be 0's. Looking at the available R-type opcodes, the last six bits might be 0x28 (which places it next to the other bit-twiddling instructions like nor). The R-type instructions all place the destination register ($t0 in the example) in bits 11–15, while the R-type instructions that use just one register place that register into bits 21–25. Bits 16–20 and 6–10 should be 0's.

For pop $t0, $s1, then, the encoding would be 000000 10001 00000 01000 00000 101000.

Xf.6.

A 16-bit encoding leads to encoding programs in less memory. A 32-bit encoding tends to lead to fewer instructions per program, and hence a greater opportunity for increased performance; but for low-cost systems, performance is not as much a priority as cost and power. (Less memory in the system means a lower power requirement.)

Xf.7.

One reason is that a more finely separated pipeline requires running the clock faster to get any advantage, but running the clock beyond 4 GHz is basically impossible due to the amount of heat generated by the fast-switching transistors.

Another is that increasing the pipeline length does not increase the clock speed proportionately, due to the delay introduced by each pipeline register. In itself, this is not a problem, until one considers data and control hazards that lead to pipeline delays. For instance, suppose that 10% of instructions turn out to be mispredicted branches, and branch predictions aren't determined until halfway through the pipeline. If an original instruction takes T time and a pipeline register introduces a delay of p, then a 10-stage pipeline would have a clock running at T/10 + p; but 10% of instructions would complete after 5 clock cycles, so the average clocks per instruction would be 0.9 ⋅ 1 + 0.1 ⋅ 5 = 1.4, for an average time per instruction of 1.4 (T/10 + p) = 0.14 T + 1.4 p. For a 20-stage pipeline, a clock cycle takes T/20 + p time, while the average clocks per instruction is 0.9 ⋅ 1 + 0.1 ⋅ 10 = 1.9, for an average time of 1.9 (T/20 + p) = 0.95 T + 1.9 p. If p is at least 9% of T, then the 10-stage pipeline will end up completing instructions more quickly.

A final reason — though not a major concern — is that the forwarding logic becomes increasingly unwieldy for longer pipelines, as logic must exist basically to forward data from every stage to every earlier stage. Eventually this logic becomes too unwieldy to be a practical use of transistors.

Xf.8.

a. 0.8 * 1 + 0.05 * 3 + 0.15 * 4 = 1.55.

b. 0.8 * 1 + 0.05 * 3 + 0.10 * 4 + 0.05 * 1 = 1.4.

c. 0.8 * 1 + 0.05 * 3 + 0.10 * 3 + 0.05 * 4 = 1.45.

Xf.9.

a. 2, 3, 5, 6. The predictions would be N, N, N, N, N, Y. The below table shows how the cache would change after each prediction.

AWN/0 SN/1 WN/2 SN/4
BWN/0 SY/3 WY/6

b. 2, 3, 6. The predictions would be N, N, N, N, Y, Y. The below table shows how the cache would change after each prediction; row “NY” represents the situation where the previous branch result was “yes” but the one before that was “no”.

NNNWN/0
NNYWN/0
NYNWN/0
NYYWN/0 SN/4
YNNWN/0
YNYWN/0 SY/3 WY/6
YYNWN/0 SY/2 SY/5
YYYWN/0 SN/1
Xf.10.

Suppose we had the following instruction sequence.

lw   $t1, 0($a0)
ori  $t2, $t1, 1
addi $a0, $a0, 4

In a pipelined processor doing in-order execution, the lw instruction would reach the MEM stage at the same time that the ori instruction reaches the EX stage. The ori instruction would have to stall at this point, because the lw instruction will not have determined the value for $t1 (which the EX stage needs to execute ori instruction) until the MEM stage completes. This stall adds one clock cycle to the time needed to complete this sequence.

This wasted clock cycle could be avoided, however, if the processor executes the addi instruction first. Then the addi instruction (which doesn't depend on the result of the lw instruction) could complete the EX stage at the time that the lw instruction completes the MEM stage. The ori instruction would complete the EX stage at the same time as it otherwise would have done.

Xf.11.

In a simultaneous multithreaded processor, the processor is aware of different threads of execution (instruction queues manipulating different sets of registers), and instructions from different threads can be dispatched into execution units in the same clock cycle.

Xf.12.
instruction first elt last elt
lv      V1, ($s0) 0…n − 16…6 + n − 1
lv      V2, ($s0) n…2 n − 16 + n…6 + 2 n − 1
addvv.d V3, V1, V2 6 + n…6 + 2 n − 18 + n…8 + 2 n − 1
sv      V3, ($s0) n…3 n − 16 + 2 n…6 + 3 n − 1
Xf.13.

“Private” memory is available to just one computation unit (“stream processor”). “Local” memory that is available to one group of computation units (“multiprocessor”, driving multiple stream processors with the same instruction stream). And “global” memory is available across the GPU (and CPU).

Xf.14.
lock: ll $t0, ($s0)
      beq $t0, $zero, lock
      li $t1, 1
      sc $t1, ($s0)
      beq $t1, $zero, lock
Xf.15.

The snoopy approach involves connecting each core to a common bus, so that all communication between a core and memory is visible to all cores. Individual cores can update their private caches based on the traffic they see on the bus.

This system tends to break down beyond eight cores because each core needs a certain bandwidth to retrieve memory, but a bus is limited to handling only one message per clock cycle. The amount of traffic that the bus can handle ends up becoming a bottleneck beyond eight cores.

Xf.16.

Cache coherency considers each memory address in isolation, looking for a guarantee that the sequence of accesses to that individual memory address is reasonable; it also allows for some fudging where individual processors' accesses are “near-simultaneous.” Cache consistency looks for a more precise ordering of instructions between different threads, and it looks for consistency in the order of access across all memory addresses.

Xf.17.

4: We could execute all of B before executing any of A, so B would get the initial values 1 and 3 for a and b.

7: We could execute B's first instruction (getting 1), then A's first instruction (setting b to 6), then B's second instruction (getting 6).

8: We could execute all of A before executing any of B, so B would get the values for a and b as set by A: 2 + 6.

Xf.18.

In the Map phase, we would decode the file and emit (studentName, gradeValue) for each student/grade pair found in the file. (We'd translate the letter grade into a number between 0 and 4 at this point.)

In the Reduce phase, we'd take the list of gradeValues received and compute both the sum of the list and the length of the list, reporting the quotient as our result for that student.