Explanation:
We expected the row-major results to be faster than the column-major results
throughout the different tests regardless of the size of the row. For two of
the three systems this was true though the differences between the row-major
and column-major vary more than expected.
We found Row-Major to be a lot faster than Column-Major for all of the matrix
sizes. The difference between Column-Major and Row-Major increased as the
matrices got smaller. This is because for the smaller Row-Major tests all or
most of the matrix was able to be stored in cache, while for the Column-Major
tests there were still significant amounts of the matrix stored off-cache and
data was still being kicked out of the cache at high rates.
The results for part 4 are interesting because they vary greatly with the
type of cache acrchitecture of the machines that are running the tests.
With the Athlon machine row-major operations outperformed column-major
operations more and more as the size of the matrix decresased. In the P4 the
difference remained relativly constant while in the test of the G4 the
difference steadily decreased to the point that the Column-major operation was
faster than Row-Major on an 8 x N/8 matrix.
The PowerMac G4 has an 8-way set associative L1 cache so on the 8xN/8 test,
each row of the matrix is one eighth the size of the cache and is divided
across the 256 different sets of the L1 cache. Since there are eight rows,
the columns are together in each set so the matrix is now stored in column
major order. When the matrix, which can fit inside the cache, is added up in
row-major order, the computer has to jump from set to set in order to add
up a row, but in column major-order, the computer can simply iterate straight
down the L1 cache which is why it's faster.
The type of cache structure on each system played a major role in the
difference between the performance of the Column-major matrix addition.
This is because since the L1 cache is set associative, it's possible to
store a matrix in it of the right size so that it would be stored in
column-major order.
The Athlon implements a 2-way set associative cache so an 8 row matrix
wouldn't take advantage of it all that well. Even when the matrix is
stored in the cache, each acess required the computer to jump to a
different set in the cache so the Column-major operation doesn't get
much faster as more of the matrix is in the cache. The P4 is 4-way
set associative so thematrix was not optimized for it but there was
still a consistent difference between the different performance times.
Data: