Lab #2:

The Memory Hierarchy and Computer Performance

Phil Katz, Connie Li, Katie Stockhammer

February 14, 2004


Abstract:

In this lab, we implemented a variety of C programs to determine information about caches on different platforms. We tested for "endian-ness", tried to do nice things to the cache, stress the cache, and test to see how the different platforms handle different types. We checked all of these things on the P4, G4, and Pentium M. We were able to quantify our result by using the running time of baseline tests and comparing them to the running time in other scenarios. This allowed us to see some pros and cons of each unique computer architecture.


Descriptions:

We followed the instructions given to us on the website: E25/CS25 Lab 2

More detailed descriptions, expectations, and analysis can be found in the Data section of this lab report. Graphs


Answers to lab questions:

Question 1: We implemented a cache stress program by testing cases in which we knew the machine's particular cache would perform well and cases in which it would perform poorly. After finding the size of each machine's cache, we then created an array that was double the size of the cache to stress the it. Since the array took up some block of memory that was larger than the cache, by specifying certain elements of the array, we caused thrashing, where in order to access the next element, another block of memory had to be brought into the cache, replacing an old line. We used the system time to quantify the effects of the cache. If the machine didn't have a cache, each memory access would take approximately the same amount of time as memory accesses took when the cache was stressed. It is clear that these accesses took much less time than when the cache was not intentionally stressed.

Question 2: We implemented superscalar test programs by implementing separate loops of eight operations, each having different levels of dependency. In the case where the eight operations were completely or very nearly independent of each other, the machine was able to achieve processor-level parallelism and execute more than one instruction at a time so that the program ran faster. When a loop contained instructions that exhibited strong dependence, the machine could not take advantage of its superscalar architecture and needed to wait until one calculation was done before beginning any of the others. We used the system clock again to find the differences in performance regarding superscalar architecture. If the machine did not have a superscalar design it would perform similarly to how it performed when the instructions were completely dependent on each other all of the time. As with the cache stress problem, there was a clear differece between the time it took the computer to complete the loop when the machine's superscalar design was utilized and the time it took the loop to complete when the machine was forced to not use its superscalar architecture.

Question 3: This lab showed us that when it is possible, it would be quite beneficial to try and take advantage of certain aspects of a machine's architecture and avoid certain practices that would stress the computer. We learned that it would be better to arrange data and instructions that we wish to use in sequential order, closer to each other so that when we need something, it would have a high probability of already being in the cache. Clustered data would keep the computer from having to continually replace lines of memeory in its L1 data cache. It would also be faster to store and then access elements of a two dimensional array in row-column order for the same reason (then data would be sequential and clustered). And although it is not always possible, it would be a good programming practice to avoid excessive dependencies in our code so that a program can take advantage of superscalar architecture and multiple instructions can be run at the same time.

Question 4: We extended the lab in part 3 by including a test to try and stress the the P4's and the G5's L2 cache as well. When running our L1 data cache tests, we found that when we tried to stress the cache, it was not really performing much worse than when the tests were cache-nice. We suspected that it was because the L1 cache was being helped because the data we wanted was not still in main memory, but already in the L2 cache. Stressing the L2 cache as well gave us results closer to what we expected (memory accesses for stressed data caches perform slower).


Data: Graphs,Tables, and Interpretations:

We first wrote a small C program to test whether a processor was little endian or big endian. To do this, we declared an integer, used a character pointer to look at each individual byte of that number. We could tell if it what kind of endian it was by how these bytes were stored. Little endian stores the least significant bytes first, putting the integer in reverse order, while big endian stores the integer normally. In regards to whether we could discern differences in the bit ordering within the bytes, we found no difference within the bytes. Only the bytes were out of order, but they contained the original value. This is what we found:

Table of Endian-ness
PlatformType of Endian
Pentium 4Little Endian
G5Big Endian
MLittle Endian


In order to complete part 2 and to test cache structure, we had to research to figure out the size, line size, and # bits per line of the cache on each platform we tested. We came up with the following results:

Types of Cache
PlatformCacheCache SizeLine Size# of LinesMapping
Pentium 4L1 Data8Kbytes64bits1024 lines 4 way set-associative
Pentium 4L2512Kbytes128 bits/line32768 lines 8 way set-associative
G5L1 Data32KBytes64 bits/line 4096 lines2 way set-associative
G5L2512KBytes128 bits/line 32768 lines8 way set-associative
ML1 Data32KBytes64 bits/line 4096 lines4 way set-associative


For part 3, we implemented code to test to see how much time each platform required to perform certain tasks. To do this, we allocated arrays, some the size of the cache and some twice the size of the cache. We then purposely accessed items already stored in the cache and compared the time it takes on each machine to access these. Then we purposely tried to perform cache stress to force the cache to retrieve a new line during each access. We believe that the design of the Pentium 4 was preventing us from thrashing the L1 cache by simply using the L2 cache, which is nearly as fast. To beat this, we decided to thrash the L2 cache on the Pentium 4. We also decided to thrash the L2 cache on the G5, because we could. When we tried to thrash the L2 on the M, however, we ran into a program error: the computer would not let us use that much data in one program without crashing. The total results and the per loop results are in the table below:

Comparison of cache retrieval times
Pentium 4G5 M
L1 ns/loopL2 ns/loop L1 ns/loopL2 ns/loop L1 ns/loop
Baseline 115.938.651.353.8 29.6
Baseline 216.445.750.160.2 25.4
Cache Stress18.550.466.279.7 33.0
2-way Nice16.840.565.568.9 32.1


In part 4, we created a 2D Matrix and compared the Pentium 4, G4, and M's processing time in accessing data from the matrix in both row-major order and then in column-major order. For each test, the matrix had 8 rows and m columns, where m was either the cache size, cache size/2, cache size/4, and cache size/8. It is clear that the progressing through a two-dimensional matrix in row major order is much faster than in column major order. This is to be expected because of the way the array is laid out in memory. Row major order is accessing addresses of memory sequentially. We also see that the 4-way set associative caches outperform the G5 2-way associative cache, especially when the number of columns is the cache size since there is less thrashing. Again, it seems as if the P4's L2 cache helps expedite the process of accessing matrix elements. Our results are below:

Comparison of cache retrieval times on a 2D Array: Pentium 4
Row Major OrderColumn Major Order
# ColumnsL1 ns/loopL1 ns/loop
Cache Size228,604.0362,183.8
Cache Size/2114,359.8176,931.4
Cache Size/456,722.087,169.3
Cache Size/827,569.142,212.0


Comparison of cache retrieval times on a 2D Array: G5
Row Major OrderColumn Major Order
Cache Size2,248,562.95,766,072.7
Cache Size/21,118,965.22,214,013.4
Cache Size/4554,930.4903,092.3
Cache Size/8285,831.8452,926.6


Comparison of cache retrieval times on a 2D Array: M
Row Major OrderColumn Major Order
Cache Size1,622,332.82,899,168.8
Cache Size/2787,131.9935,344.9
Cache Size/4390,561.6466,671.1
Cache Size/8214,308.1225,324.0


For part 5, we wanted to look at how each platform handled different types such as 32-bit integer, single precision floating point, double-precision point, and 8-bit integer. We performed addition, subtraction, multiplication, and division on each of these types and compared the calculation times. Our results were mostly what we expected. Addition and subtraction took roughly the same amount of time and since they were the least complex operations, were the fastest. Multiplication on all three processors was completed in the next fastest time and division was clearly the slowest. The G5 had the major advantage though, in performing floating point and double precision calculations, as advertised. Something that was somewhat surprising was that floating point calculations sometimes took longer than double precision calculations. We expected that increasing the number of bits involved would increase the time it took to calculate. The results are below:

Pentium 4: Comparison of types (ns/loop)
AdditionMultiplicationSubtractionDivision
Double-precision9.22698.79.33079
Float9.22708.69.33080.7
32-bit Integer7.15.87.112.5
8-bit Integer7.56.37.145.7


G5: Comparison of types (ns/loop)
AdditionMultiplicationSubtractionDivision
Double-precision16.113.213.818.8
Float21.018.819.434.0
32-bit Integer15.914.314.713.4
8-bit Integer14.112.212.628.2


M: Comparison of types (ns/loop)
AdditionMultiplicationSubtractionDivision
Double-precision14.7801.213.41301.9
Float14.9801.214.5801.2
32-bit Integer11.58.010.6200.3
8-bit Integer13.011.811.4103.7


For part 6, we had to investigate superscalar architecture. We were told that each system would us multiple ALUs and FPUs in parallel, when possible, to speed up arithmetic operations. When a series of independent computations is performed, as in our baseline 1, the CPU can run them in parallel. However, as we start to increase the dependencies, the CPU has to wait for each operation to finish before it can move on to the next dependent operation. This is clearly evident on all three systems as the tests took significantly longer as we increased dependencies. The results are as follows:

Effects of Superscalar Architecture: Pentium 4
Int (ns/loop)Float (ns/loop)
Baseline I8.010.8
Baseline II8.711.2
Dependency 38.112.0
Dependency 222.830.3


Effects of Superscalar Architecture: G5
Int (ns/loop)Float (ns/loop)
Baseline I16.019.3
Baseline II14.725.0
Dependency 327.266.6
Dependency 267.1132.7


Effects of Superscalar Architecture: M
Int (ns/loop)Float (ns/loop)
Baseline I12.213.8
Baseline II11.213.5
Dependency 318.122.1
Dependency 236.447.5