Chapter 4.: Ex. 1. Caches and Address Translation. Consider A 64 Byte Cache With 8 Byte Blocks, An Associativity of

Chapter 4.

Ex. 1. Caches and Address Translation. Consider a 64byte cache with 8 byte blocks, an associativity of
2 and LRU block replacement. Virtual addresses are 16 bits. The cache is physically tagged. The
processor has 16KB of physical memory.
a) What is the total number of tag bits?
b) For the following sequence of references, label the cache misses.Also, label each miss as being
either a compulsory miss, a capacity miss, or a conflict miss. The addresses are given in octal
(each digit represents 3 bits). Assume the cache initially contains block addresses: 000, 010, 020,
030, 040, 050, 060, and 070 which were accessed in that order.
c) Which of the following techniques are aimed at reducing the cost of a miss: dividing the current
block into subblocks, a larger block size, the addition of a second level cache, the addition of a
victim buffer, early restart with critical word first, a writeback buffer, skewed associativity,
software prefetching, the use of a TLB, and multiporting.
d) Why are the first level caches usually split (instructions and data are in different caches) while the
L2 is usually unified (instructions and data are both in the same cache)?
Ex. 2. Assume the following 10bit address sequence generated by the microprocessor:
The cache uses 4 bytes per block. Assume a 2way set assocative cache design that uses the LRU
algorithm (with a cache that can hold a total of 4 blocks). Assume that the cache is initially empty. First
determine the TAG, SET, BYTE OFFSET fields and fill in the table above. In the figure below, clearly
mark for each access the TAG, Least Recently Used (LRU), and HIT/MISS information for each access.
And then, derive the hit ratio for the access sequence.
Ex. 3.
a) Why is miss rate not a good metric for evaluating cache performance? What is the appropriate
metric? Give its definition. What is the reason for using a combination of first and second level
caches rather than using the same chip area for a larger firstlevel cache?
b) The original motivation for using virtual memory was “compatibility”. What does that mean in
this context? What are two other motivations for using virtual memory?
c) What are the two characteristics of program memory accesses that caches exploit?
d) What are three types of cache misses?
Ex. 4. Design a 128KB directmapped data cache that uses a 32bit address and 16 bytes per block.
Calculate the following:
a) How many bits are used for the byte offset?
b) How many bits are used for the set (index) field?
c) How many bits are used for the tag?
Ex. 5. Design a 8way set associative cache that has 16 blocks and 32 bytes per block. Assume a 32 bit
address. Calculate the following:
a) How many bits are used for the byte offset?
b) How many bits are used for the set (index) field?
c) How many bits are used for the tag?
Ex. 6. This question covers cache and pipeline performance analysis.
a) Write the formula for the average memory access time assuming one level of cache memory:
b) For a data cache with a 92% hit rate and a 2cycle hit latency, calculate the average memory
access latency. Assume that latency to memory and the cache miss penalty together is 124 cycles.
Note: The cache must be accessed after memory returns the data.
c) Calculate the performance of a processor taking into account stalls due to data cache and
instruction cache misses. The data cache (for loads and stores) is the same as described in Part B
and 30% of instructions are loads and stores. The instruction cache has a hit rate of 90% with a
miss penalty of 50 cycles. Assume the base CPI using a perfect memory system is 1.0. Calculate
the CPI of the pipeline, assuming everything else is working perfectly. Assume the load never
stalls a dependent instruction and assume the processor must wait for stores to finish when they
miss the cache. Finally, assume that instruction cache misses and data cache misses never occur
at the same time. Show your work.
· Calculate the additional CPI due to the icache stalls.
· Calculate the additional CPI due to the dcache stalls.
· Calculate the overall CPI for the machine.
Ex. 7. A processor has a 32 byte memory and an 8 byte directmapped cache. Table 0 shows the current
state of the cache. Write hit or miss under the each address in the memory reference sequence
below. Show the new state of the cache for each miss in a new table, label the table with the
address, and circle the change. Calculate Hit, Miss rates.
Ex. 8. A processor has a 32 byte memory and an 8 byte 4way set associative cache. Table 0 shows the
current state of the cache. Use the Least Recently Used replacement policy. Write hit or miss
under the each address in the memory reference sequence below. Show the new state of the
cache for each miss in a new table, label the table with the address, and circle the change.
Calculate Hit, Miss rates.
Ex. 9. How many total SRAM bits will be required to implement a 256KB fourway set associative
cache. The cache is physicallyindexed cache, and has 64byte blocks. Assume that there are 4
extra bits per entry: 1 valid bit, 1 dirty bit, and 2 LRU bits for the replacement policy. Assume
that the physical address is 50 bits wide.
Ex. 10. Caches: Misses and Hits
int i;
int a[1024*1024];
int x=0;
for(i=0;i<1024;i++)
{
x+=a[i]+a[1024*i];
}
Consider the code snippet in code above. Suppose that it is executed on a system with a 2way set
associative 16KB data cache with 32byte blocks, 32bit words, and an LRU replacement policy. Assume
that int is wordsized. Also assume that the address of a is 0x0, that i and x are in registers, and that the
cache is initially empty. How many data cache misses are there? How many hits are there?
Ex. 11. Describe the general characteristics of a program that would exhibit very little temporal and
spatial locality with regard to instruction fetches. Provide an example of such a program (pseudo
code is fine). Also, describe the cache effects of excessive unrolling. Use the terms static
instructions and dynamic instructions in your description.
Ex. 12. You are given an empty 16K 2way setassociative LRUreplacement cache with 32 byte blocks
on a machine with 4 byte words and 32bit addresses. Describe in mathematical terms a memory
read address sequence which yields the following Hit/Miss patterns. If such a sequence is
impossible, state why. Some sample sequences are given:
Hit/Miss pattern Address sequence being accessed
Miss, Hit, Hit, Miss : 0,0,0,32
Miss, (Hit)* 0, 0*
(Hit)*
(Miss)*
(Miss, Hit)*
Ex. 13. Assume an instruction cache miss rate for gcc of 2% and a data cache miss rate of 4%. If a
machine has a CPI of 2 without any memory stalls and the miss penalty is 40 cycles for all
misses, determine how much faster a machine would run with a perfect cache that never missed.
Assume 36% of instructions are loads/stores.
Ex. 14. Consider the following piece of code:
int x = 0, y = 0; // The compiler puts x in r1 and y in r2.
int i; // The compiler put i in r3.
int A[4096]; // A is in memory at address 0x10000
...
for (i=0;i<1024;i++) {
x += A[i];
}
for (i=0;i<1024;i++) {
y += A[i+2048];
}
a) Assume that the system has a 8192byte, directmapped data cache with 16byte blocks.
Assuming that the cache starts out empty, what is the series of data cache hits and misses for this
snippet of code. Assume that ints are 32bits.
b) Assume that an iteration of a loop in which the load hits takes 10 cycles but that an iteration of a
loop in which the load misses takes 100 cycles. What is the execution time of this snippet with
the aforementioned cache?
Ex. 15. Consider the following piece of code:
int x = 0, y = 0; // The compiler puts x in r1 and y in r2.
int i; // The compiler put i in r3.
int A[4096]; // A is in memory at address 0x10000
...
for (i=0;i<1024;i++) {
x += A[i];
}
for (i=0;i<1024;i++) {
y += A[i+2048];
}
a) Assume that the system has a 8192byte, the cache is 2way set associative with an LRU
replacement policy and 16byte sets (8byte blocks). Assuming that the cache starts out empty,
what is the series of data cache hits and misses for this snippet of code. Assume that ints are 32
bits.
b) Assume that an iteration of a loop in which the load hits takes 10 cycles but that an iteration of a
loop in which the load misses takes 100 cycles. What is the execution time of this snippet with
the aforementioned cache?
Ex. 16. Xét hệ thống máy tính bao gồm các thành phần sau:
1. Bộ xử lý có tốc độ đồng hồ 2.4Ghz
2. 2 bộ đệm ánh xạ trực tiếp L1, kích thước 32KB
· Bộ đệm lệnh (Icache), mỗi đường (khối) kích thước 32bytes
· Bộ đệm dữ liệu, kích thước đường (khối) 16 bytes. Bộ đệm này cho phép ghi xuyên (write
through) với phần đệm ghi (writebuffer)
· 2 bộ đệm L1 được truy cập với thời gian bằng chu kỳ đồng hồ của bộ xử lý (no penalty).
· 2 bộ đệm L1 được nối với bộ đệm L2 bằng đường bus 128 bit, tốc độ 800Mhz
3. Bộ đệm L2 dùng chung, là bộ đệm ghi sau có các thông số sau:
· Thời gian lấy địa chỉ và cung cấp 1 đường nhớ là 5ns (access time)
· Kích thước đường là 64bytes
· Kết nối với bộ nhớ chính bằng đường bus 128bit tốc độ 400Mhz
4. Bộ nhớ chính có thời gian truy cập là 20ns
Khi thực hiện một chương trình benchmark, ta đo được các thông số sau: Tỉ lệ miss: ở bộ đệm Icache là
2%, Dcache là 5% và ở L2 cache là 20%. Phần đệm ghi (writebuffer) cho phép loại bỏ tạm dừng trong
95% trường hợp ghi. 50% tổng số các đường bị thay thế cần phải ghi lại vào bộ nhớ (bit dirty = 1).
Xác định
a. Thời gian truy cập lệnh trung bình
b. Thời gian đọc dữ liệu trung bình
c. Thời gian ghi dữ liệu trung bình
Ex. 17. Xét bộ đệm kết hợp 4 đường, kích thước 32KB của một hệ thống đơn xử lý 32 bit. Kích thước
mỗi khối (đường) là 4 từ. Bộ đệm sử dụng chiến lược ghi sau. Xác định tổng số bít cần để triển
khai bộ đệm nói trên. Số lượng bít này bào gồm số bit để lưu dữ liệu, trường tags, và bít trạng
thái.
Ex. 18. Xét 01 bộ đệm ánh xạ trực tiếp, 01 bộ đệm kết hợp 2 đường và 01 bộ đệm kết hợp toàn phần. Mỗi
bộ đệm có kích thước 4 đường (khối); Kích thước 1 khối là 1 word. Giả sử bộ xử lý truy cập vào
các từ nhớ ở địa chỉ 0, 8, 0, 6, 8 theo thứ tự tương ứng. Giả sử tất cả các khối trong bộ đệm lúc
bắt đầu được đánh dấu không hợp lệ. Bộ đệm kết hợp và bộ đệm kết hợp toàn phần sử dụng chiến
lược thay thế LRU.
a) Xác định bảng cache cho từng loại bộ đệm chỉ ra nội dung bộ đệm sau mỗi lần truy cập bộ nhớ.
b) Xác định cho từng loại bộ đệm số lần cache miss.
Ex. 19. Cho 3 bộ xử lý giống nhau được trang bị 3 bộ đệm khác nhau. Sử dụng 1 chương trình kiểm tra,
các thông số của bộ đệm được xác định như bảng dưới đây:
Cache Tỉ lệ miss lệnh Tỉ lệ miss dữ liệu
Cache 1 : Ánh xạ trực tiếp, khối 4% 8%
(đường) kích thước 1 word.
Cache 2 : Ánh xạ trực tiếp, khối 2% 5%
kích thước 4 words.
Cache 3 : Kết hợp 2 đường; khối 2% 4%
kích thước 4 words
Với chương trình kiểm tra nói trên, một nửa số lệnh cần truy cập bộ nhớ 1 lần. Mỗi lần xảy ra miss, ta cân
(6+kích thước khối) chu kỳ đồng hồ để truy cập bộ nhớ. Bộ xử lý thứ nhất có CPI là 2.0. Chu kỳ đồng hồ
của bộ xử lý 1, 2 và 3 tương ứng là 2ns, 2ns và 2.4ns
Bộ xử lý nào cần nhiều thời gian nhất theo số chu kỳ đồng hồ khi xảy ra cache miss.
Bộ xử lý nào nhanh nhất, bộ xử lý nào chậm nhất.

Chapter 4.: Ex. 1. Caches and Address Translation. Consider A 64 Byte Cache With 8 Byte Blocks, An Associativity of

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4.: Ex. 1. Caches and Address Translation. Consider A 64 Byte Cache With 8 Byte Blocks, An Associativity of

Uploaded by

Copyright:

Available Formats

Chapter 4.

You might also like