Explanation:
We expected the row-major results to be faster than the column-major results
throughout the different tests regardless of the size of the row. For two of
the three systems this was true though the differences between the row-major
and column-major vary more than expected.
We found Row-Major to be a lot faster than Column-Major for all of the matrix
sizes.  The difference between Column-Major and Row-Major increased as the 
matrices got smaller.  This is because for the smaller Row-Major tests all or 
most of the matrix was able to be stored in cache, while for the Column-Major
tests there were still significant amounts of the matrix stored off-cache and 
data was still being kicked out of the cache at high rates.
The results for part 4 are interesting because they vary greatly with the 
type of cache acrchitecture of the machines that are running the tests. 
With the Athlon machine row-major operations outperformed column-major 
operations more and more as the size of the matrix decresased. In the P4 the 
difference remained relativly constant while in the test of the G4 the 
difference steadily decreased to the point that the Column-major operation was 
faster than Row-Major on an 8 x N/8 matrix.
The PowerMac G4 has an 8-way set associative L1 cache so on the 8xN/8 test, 
each row of the matrix is one eighth the size of the cache and is divided 
across the 256 different sets of the L1 cache. Since there are eight rows, 
the columns are together in each set so the matrix is now stored in column 
major order. When the matrix, which can fit inside the cache, is added up in 
row-major order, the computer has to jump from set to set in order to add 
up a row, but in column major-order, the computer can simply iterate straight 
down the L1 cache which is why it's faster.
The type of cache structure on each system played a major role in the 
difference between the performance of the Column-major matrix addition. 
This is because since the L1 cache is set associative, it's possible to 
store a matrix in it of the right size so that it would be stored in 
column-major order.
The Athlon implements a 2-way set associative cache so an 8 row matrix 
wouldn't take advantage of it all that well. Even when the matrix is 
stored in the cache, each acess required the computer to jump to a 
different set in the cache so the Column-major operation doesn't get 
much faster as more of the matrix is in the cache. The P4 is 4-way 
set associative so thematrix was not optimized for it but there was 
still a consistent difference between the different performance times.
Data: