CS25 - Lab 2
Part 4: Write-Up


Assignment | Explanation | Data


Assignment:

A very common thing for a program to do is access all the elements of a 2-D matrix. Generally matrices are stored in row-major order, so if you make your outer loop over the rows and your inner loop over the columns you access memory in linear order.

For this task, start by creating a matrix of integers that is N rows by M columns, where N = 8 and M = size of cache. Fill the matrix with random integers between 0-3. Then write a for loop that sums the elements of the matrix in row-major order (probably within another for loop so you can time it). Do the same thing but access the matrix in column-major order. Compare the times.

Repeat the task for M = cache size/2, M = cache size/4 and, M = cache size/8. Compare the results.


Explanation:

We expected the row-major results to be faster than the column-major results throughout the different tests regardless of the size of the row. For two of the three systems this was true though the differences between the row-major and column-major vary more than expected.

We found Row-Major to be a lot faster than Column-Major for all of the matrix sizes. The difference between Column-Major and Row-Major increased as the matrices got smaller. This is because for the smaller Row-Major tests all or most of the matrix was able to be stored in cache, while for the Column-Major tests there were still significant amounts of the matrix stored off-cache and data was still being kicked out of the cache at high rates.

The results for part 4 are interesting because they vary greatly with the type of cache acrchitecture of the machines that are running the tests. With the Athlon machine row-major operations outperformed column-major operations more and more as the size of the matrix decresased. In the P4 the difference remained relativly constant while in the test of the G4 the difference steadily decreased to the point that the Column-major operation was faster than Row-Major on an 8 x N/8 matrix.

The PowerMac G4 has an 8-way set associative L1 cache so on the 8xN/8 test, each row of the matrix is one eighth the size of the cache and is divided across the 256 different sets of the L1 cache. Since there are eight rows, the columns are together in each set so the matrix is now stored in column major order. When the matrix, which can fit inside the cache, is added up in row-major order, the computer has to jump from set to set in order to add up a row, but in column major-order, the computer can simply iterate straight down the L1 cache which is why it's faster.

The type of cache structure on each system played a major role in the difference between the performance of the Column-major matrix addition. This is because since the L1 cache is set associative, it's possible to store a matrix in it of the right size so that it would be stored in column-major order.

The Athlon implements a 2-way set associative cache so an 8 row matrix wouldn't take advantage of it all that well. Even when the matrix is stored in the cache, each acess required the computer to jump to a different set in the cache so the Column-major operation doesn't get much faster as more of the matrix is in the cache. The P4 is 4-way set associative so thematrix was not optimized for it but there was still a consistent difference between the different performance times.


Data:

AMD Athlon XP
Cache Size 65536 bytes, Bytes/Line 64
8 x N=65536 Tests Total Time (seconds) Time per Loop (nseconds) 8xN/M / 8xN RM/CM
8 x N Row-Major 5.547s 5546933.2 ns/loop 1 .920
Column-Major 6.028s 6027740.0 ns/loop 1
8 x N/2 Row-Major 2.762s 2762118.1 ns/loop .498 .923
Column-Major 2.991s 2990942.0 ns/loop .496
8 x N/4 Row-Major 1.380s 1379765.0 ns/loop .249 .555
Column-Major 2.486s 2485744.0 ns/loop .412
8 x N/8 Row-Major 0.523s 523364.1 ns/loop .094 .436
Column-Major 1.201s 1201414.8 ns/loop .199

Pentium 4
Cache Size 8192 bytes, Bytes/Line 64
8 x N=8192 Tests Total Time (seconds) Time per Loop (nseconds) 8xN/M / 8xN RM/CM
8 x N Row-Major 6.550s 655017.7 ns/loop 1 .782
Column-Major 8.381s 838122.9 ns/loop 1
8 x N/2 Row-Major 3.245s 324527.3 ns/loop .495 .794
Column-Major 4.085s 408507.9 ns/loop .487
8 x N/4 Row-Major 1.621s 162073.0 ns/loop .247 .793
Column-Major 2.043s 204282.4 ns/loop .312
8 x N/8 Row-Major 0.810s 81002.2 ns/loop .124 .794
Column-Major 1.02s 102080.2 ns/loop .122

PowerPC G4
Cache Size 32768 bytes, Bytes/Line 32
8 x N=32768 Tests Total Time (seconds) Time per Loop (nseconds) 8xN/M / 8xN RM/CM
8 x N Row-Major 1.496s 14956460.0 ns/loop 1 .606
Column-Major 2.468s 24681489.5 ns/loop 1
8 x N/2 Row-Major 0.768s 7678899.8 ns/loop .513 .894
Column-Major 0.859s 8585250.4 ns/loop .348
8 x N/4 Row-Major 0.347s 3470559.1 ns/loop .232 .828
Column-Major 0.419s 4190890.8 ns/loop .170
8 x N/8 Row-Major 0.183s 1825289.7 ns/loop .122 1.075
Column-Major 0.170s 1697678.6 ns/loop .069