CS25 - Lab 2
Part 6: Write-Up

Assignment | Explanation | Data


Assignment:

Write a program to demonstrate the effect of a superscalar architecture. Your program should at least consider the following four cases.

  1. Baseline I: No dependencies at all, just add constants to operands, or add operands so that each operation is independent.
  2. Baseline II: Operation four is dependent upon operation 1.
  3. Dependency 3: Operation three is dependent upon operation 1.
  4. Dependency 2: Every operation is dependent upon the previous one.

You may also want to look at mixing floating point and integer operations to see the effect of having floating point as well as integer ALUs. In other words, if you have N ALUs and M FPUs, what is the relative change in computation time for adding another ALU or FPU operation so long as the number of each is less than N and M, respectively? Compare that to the change in computation time for adding another ALU or FPU operation so long when the number of each is greater than N or M, respectively.

Note that, in order to create dependencies you have to fool the optimization that goes on in the compiler. The compiler will sometimes re-order operations to take advantage of the superscalar abilities of the processor. Your program should be a simple for loop that contains at least 4 operations, if not 8, using numerous variables (all local and not pointers). If you find that you are not getting different results for the above cases, then you are not fooling the compiler and it is figuring out a way to optimize the code for you. Students in the past have actually disassembled their code to get at the exact instructions the compiler created in this situation. You may want to do something like this just to prove that your results are due to the superscalar architecture and not something else.


Explanation:

Since x86 assembly code won't run on a G4, nor sparc, we wrote the code in plain C and checked the assembly output to make sure stuff wasn't optimized away. The results for the test programs were what one would expect for a superscalar acrchitecture. The Baseline I test without dependecies ran in the smallest amountof time and the Dependency 2 test ran in the slowest amount of time while the twotests with one dependancy ran somewhere in between. What was interesting was that the test with the Dependency 3 test ran faster than the Baseline II eventhough they each only have one dependency. Also, If you look at the assembly output for part 6, there are actually more instructions in the dependency2 loop because the variables have to be moved around between the registers so that they can be added.


Data:

AMD Athlon XP
Cache Size 65536 bytes, Bytes/Line 64
Test Name Total Time (seconds) Time per Loop (nseconds)
Baseline 1 0.601s 12.0 ns/loop
Baseline 2 0.671s 13.4 ns/loop
Dependency 3 0.663s 13.3 ns/loop
Dependency 2 0.891s 17.8 ns/loop

Pentium 4
Cache Size 8192 bytes, Bytes/Line 64
Test Name Total Time (seconds) Time per Loop (nseconds)
Baseline 1 0.691s 13.8 ns/loop
Baseline 2 0.731s 14.6 ns/loop
Dependency 3 0.763s 15.3 ns/loop
Dependency 2 1.264s 25.3 ns/loop

PowerPC G4
Cache Size 32768 bytes, Bytes/Line 32
Test Name Total Time (seconds) Time per Loop (nseconds)
Baseline 1 2.868s 57.4 ns/loop
Baseline 2 3.216s 64.3 ns/loop
Dependency 3 3.050s 61.0 ns/loop
Dependency 2 4.893s 97.9 ns/loop