Lab #2:

The Memory Hierarchy and Computer Performance

Phil Katz, Connie Li, Katie Stockhammer

February 14, 2004

Abstract:

In this lab, we implemented a variety of C programs to determine information about caches on different platforms. We tested for "endian-ness", tried to do nice things to the cache, stress the cache, and test to see how the different platforms handle different types. We checked all of these things on the P4, G4, and Pentium M. We were able to quantify our result by using the running time of baseline tests and comparing them to the running time in other scenarios. This allowed us to see some pros and cons of each unique computer architecture.

Descriptions:

We followed the instructions given to us on the website: E25/CS25 Lab 2

More detailed descriptions, expectations, and analysis can be found in the Data section of this lab report. Graphs

Answers to lab questions:

Question 1: We implemented a cache stress program by testing cases in which we knew the machine's particular cache would perform well and cases in which it would perform poorly. After finding the size of each machine's cache, we then created an array that was double the size of the cache to stress the it. Since the array took up some block of memory that was larger than the cache, by specifying certain elements of the array, we caused thrashing, where in order to access the next element, another block of memory had to be brought into the cache, replacing an old line. We used the system time to quantify the effects of the cache. If the machine didn't have a cache, each memory access would take approximately the same amount of time as memory accesses took when the cache was stressed. It is clear that these accesses took much less time than when the cache was not intentionally stressed.

Question 2: We implemented superscalar test programs by implementing separate loops of eight operations, each having different levels of dependency. In the case where the eight operations were completely or very nearly independent of each other, the machine was able to achieve processor-level parallelism and execute more than one instruction at a time so that the program ran faster. When a loop contained instructions that exhibited strong dependence, the machine could not take advantage of its superscalar architecture and needed to wait until one calculation was done before beginning any of the others. We used the system clock again to find the differences in performance regarding superscalar architecture. If the machine did not have a superscalar design it would perform similarly to how it performed when the instructions were completely dependent on each other all of the time. As with the cache stress problem, there was a clear differece between the time it took the computer to complete the loop when the machine's superscalar design was utilized and the time it took the loop to complete when the machine was forced to not use its superscalar architecture.

Question 3: This lab showed us that when it is possible, it would be quite beneficial to try and take advantage of certain aspects of a machine's architecture and avoid certain practices that would stress the computer. We learned that it would be better to arrange data and instructions that we wish to use in sequential order, closer to each other so that when we need something, it would have a high probability of already being in the cache. Clustered data would keep the computer from having to continually replace lines of memeory in its L1 data cache. It would also be faster to store and then access elements of a two dimensional array in row-column order for the same reason (then data would be sequential and clustered). And although it is not always possible, it would be a good programming practice to avoid excessive dependencies in our code so that a program can take advantage of superscalar architecture and multiple instructions can be run at the same time.

Question 4: We extended the lab in part 3 by including a test to try and stress the the P4's and the G5's L2 cache as well. When running our L1 data cache tests, we found that when we tried to stress the cache, it was not really performing much worse than when the tests were cache-nice. We suspected that it was because the L1 cache was being helped because the data we wanted was not still in main memory, but already in the L2 cache. Stressing the L2 cache as well gave us results closer to what we expected (memory accesses for stressed data caches perform slower).

Data: Graphs,Tables, and Interpretations:

We first wrote a small C program to test whether a processor was little endian or big endian. To do this, we declared an integer, used a character pointer to look at each individual byte of that number. We could tell if it what kind of endian it was by how these bytes were stored. Little endian stores the least significant bytes first, putting the integer in reverse order, while big endian stores the integer normally. In regards to whether we could discern differences in the bit ordering within the bytes, we found no difference within the bytes. Only the bytes were out of order, but they contained the original value. This is what we found:

Table of Endian-ness
Platform Type of Endian

Pentium 4 Little Endian

G5 Big Endian

M Little Endian

**Table of Endian-ness**
Platform	Type of Endian
Pentium 4	Little Endian
G5	Big Endian
M	Little Endian

In order to complete part 2 and to test cache structure, we had to research to figure out the size, line size, and # bits per line of the cache on each platform we tested. We came up with the following results:

Types of Cache
Platform Cache Cache Size Line Size # of Lines Mapping

Pentium 4 L1 Data 8Kbytes 64bits 1024 lines 4 way set-associative

Pentium 4 L2 512Kbytes 128 bits/line 32768 lines 8 way set-associative

G5 L1 Data 32KBytes 64 bits/line 4096 lines 2 way set-associative

G5 L2 512KBytes 128 bits/line 32768 lines 8 way set-associative

M L1 Data 32KBytes 64 bits/line 4096 lines 4 way set-associative

**Types of Cache**
Platform	Cache	Cache Size	Line Size	# of Lines	Mapping
Pentium 4	L1 Data	8Kbytes	64bits	1024 lines	4 way set-associative
Pentium 4	L2	512Kbytes	128 bits/line	32768 lines	8 way set-associative
G5	L1 Data	32KBytes	64 bits/line	4096 lines	2 way set-associative
G5	L2	512KBytes	128 bits/line	32768 lines	8 way set-associative
M	L1 Data	32KBytes	64 bits/line	4096 lines	4 way set-associative

For part 3, we implemented code to test to see how much time each platform required to perform certain tasks. To do this, we allocated arrays, some the size of the cache and some twice the size of the cache. We then purposely accessed items already stored in the cache and compared the time it takes on each machine to access these. Then we purposely tried to perform cache stress to force the cache to retrieve a new line during each access. We believe that the design of the Pentium 4 was preventing us from thrashing the L1 cache by simply using the L2 cache, which is nearly as fast. To beat this, we decided to thrash the L2 cache on the Pentium 4. We also decided to thrash the L2 cache on the G5, because we could. When we tried to thrash the L2 on the M, however, we ran into a program error: the computer would not let us use that much data in one program without crashing. The total results and the per loop results are in the table below:

Comparison of cache retrieval times
Pentium 4 G5 M

L1 ns/loop L2 ns/loop L1 ns/loop L2 ns/loop L1 ns/loop
Baseline 1 15.9 38.6 51.3 53.8 29.6

Baseline 2 16.4 45.7 50.1 60.2 25.4

Cache Stress 18.5 50.4 66.2 79.7 33.0

2-way Nice 16.8 40.5 65.5 68.9 32.1

**Comparison of cache retrieval times**
	Pentium 4	G5	M
	L1 ns/loop	L2 ns/loop	L1 ns/loop	L2 ns/loop	L1 ns/loop
Baseline 1	15.9	38.6	51.3	53.8	29.6
Baseline 2	16.4	45.7	50.1	60.2	25.4
Cache Stress	18.5	50.4	66.2	79.7	33.0
2-way Nice	16.8	40.5	65.5	68.9	32.1

In part 4, we created a 2D Matrix and compared the Pentium 4, G4, and M's processing time in accessing data from the matrix in both row-major order and then in column-major order. For each test, the matrix had 8 rows and m columns, where m was either the cache size, cache size/2, cache size/4, and cache size/8. It is clear that the progressing through a two-dimensional matrix in row major order is much faster than in column major order. This is to be expected because of the way the array is laid out in memory. Row major order is accessing addresses of memory sequentially. We also see that the 4-way set associative caches outperform the G5 2-way associative cache, especially when the number of columns is the cache size since there is less thrashing. Again, it seems as if the P4's L2 cache helps expedite the process of accessing matrix elements. Our results are below:

Comparison of cache retrieval times on a 2D Array: Pentium 4
Row Major Order Column Major Order

# Columns L1 ns/loop L1 ns/loop
Cache Size 228,604.0 362,183.8

Cache Size/2 114,359.8 176,931.4

Cache Size/4 56,722.0 87,169.3

Cache Size/8 27,569.1 42,212.0

Comparison of cache retrieval times on a 2D Array: G5
Row Major Order Column Major Order

Cache Size 2,248,562.9 5,766,072.7

Cache Size/2 1,118,965.2 2,214,013.4

Cache Size/4 554,930.4 903,092.3

Cache Size/8 285,831.8 452,926.6

Comparison of cache retrieval times on a 2D Array: M
Row Major Order Column Major Order

Cache Size 1,622,332.8 2,899,168.8

Cache Size/2 787,131.9 935,344.9

Cache Size/4 390,561.6 466,671.1

Cache Size/8 214,308.1 225,324.0

**Comparison of cache retrieval times on a 2D Array: Pentium 4**
	Row Major Order	Column Major Order
# Columns	L1 ns/loop	L1 ns/loop
Cache Size	228,604.0	362,183.8
Cache Size/2	114,359.8	176,931.4
Cache Size/4	56,722.0	87,169.3
Cache Size/8	27,569.1	42,212.0

**Comparison of cache retrieval times on a 2D Array: G5**
	Row Major Order	Column Major Order
Cache Size	2,248,562.9	5,766,072.7
Cache Size/2	1,118,965.2	2,214,013.4
Cache Size/4	554,930.4	903,092.3
Cache Size/8	285,831.8	452,926.6

**Comparison of cache retrieval times on a 2D Array: M**
	Row Major Order	Column Major Order
Cache Size	1,622,332.8	2,899,168.8
Cache Size/2	787,131.9	935,344.9
Cache Size/4	390,561.6	466,671.1
Cache Size/8	214,308.1	225,324.0

For part 5, we wanted to look at how each platform handled different types such as 32-bit integer, single precision floating point, double-precision point, and 8-bit integer. We performed addition, subtraction, multiplication, and division on each of these types and compared the calculation times. Our results were mostly what we expected. Addition and subtraction took roughly the same amount of time and since they were the least complex operations, were the fastest. Multiplication on all three processors was completed in the next fastest time and division was clearly the slowest. The G5 had the major advantage though, in performing floating point and double precision calculations, as advertised. Something that was somewhat surprising was that floating point calculations sometimes took longer than double precision calculations. We expected that increasing the number of bits involved would increase the time it took to calculate. The results are below:

Pentium 4: Comparison of types (ns/loop)
Addition Multiplication Subtraction Division

Double-precision 9.2 2698.7 9.3 3079

Float 9.2 2708.6 9.3 3080.7

32-bit Integer 7.1 5.8 7.1 12.5

8-bit Integer 7.5 6.3 7.1 45.7

G5: Comparison of types (ns/loop)
Addition Multiplication Subtraction Division

Double-precision 16.1 13.2 13.8 18.8

Float 21.0 18.8 19.4 34.0

32-bit Integer 15.9 14.3 14.7 13.4

8-bit Integer 14.1 12.2 12.6 28.2

M: Comparison of types (ns/loop)
Addition Multiplication Subtraction Division

Double-precision 14.7 801.2 13.4 1301.9

Float 14.9 801.2 14.5 801.2

32-bit Integer 11.5 8.0 10.6 200.3

8-bit Integer 13.0 11.8 11.4 103.7

**Pentium 4: Comparison of types (ns/loop)**
	Addition	Multiplication	Subtraction	Division
Double-precision	9.2	2698.7	9.3	3079
Float	9.2	2708.6	9.3	3080.7
32-bit Integer	7.1	5.8	7.1	12.5
8-bit Integer	7.5	6.3	7.1	45.7

**G5: Comparison of types (ns/loop)**
	Addition	Multiplication	Subtraction	Division
Double-precision	16.1	13.2	13.8	18.8
Float	21.0	18.8	19.4	34.0
32-bit Integer	15.9	14.3	14.7	13.4
8-bit Integer	14.1	12.2	12.6	28.2

**M: Comparison of types (ns/loop)**
	Addition	Multiplication	Subtraction	Division
Double-precision	14.7	801.2	13.4	1301.9
Float	14.9	801.2	14.5	801.2
32-bit Integer	11.5	8.0	10.6	200.3
8-bit Integer	13.0	11.8	11.4	103.7

For part 6, we had to investigate superscalar architecture. We were told that each system would us multiple ALUs and FPUs in parallel, when possible, to speed up arithmetic operations. When a series of independent computations is performed, as in our baseline 1, the CPU can run them in parallel. However, as we start to increase the dependencies, the CPU has to wait for each operation to finish before it can move on to the next dependent operation. This is clearly evident on all three systems as the tests took significantly longer as we increased dependencies. The results are as follows:

Effects of Superscalar Architecture: Pentium 4
Int (ns/loop) Float (ns/loop)

Baseline I 8.0 10.8

Baseline II 8.7 11.2
Dependency 3 8.1 12.0

Dependency 2 22.8 30.3

Effects of Superscalar Architecture: G5
Int (ns/loop) Float (ns/loop)

Baseline I 16.0 19.3

Baseline II 14.7 25.0
Dependency 3 27.2 66.6

Dependency 2 67.1 132.7

Effects of Superscalar Architecture: M
Int (ns/loop) Float (ns/loop)

Baseline I 12.2 13.8

Baseline II 11.2 13.5
Dependency 3 18.1 22.1

Dependency 2 36.4 47.5

	Int (ns/loop)	Float (ns/loop)
Baseline I	8.0	10.8
Baseline II	8.7	11.2
Dependency 3	8.1	12.0
Dependency 2	22.8	30.3

	Int (ns/loop)	Float (ns/loop)
Baseline I	16.0	19.3
Baseline II	14.7	25.0
Dependency 3	27.2	66.6
Dependency 2	67.1	132.7