COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems Chapter 9 Memory Hierarchy ©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr. 9 Memory Hierarchy • Up to now… Black Box MEMORY • • • • Reality… Processors have cycle times of ~1 ns Fast DRAM has a cycle time of ~100 ns We have to bridge this gap for pipelining to be effective! 9 Memory Hierarchy • Clearly fast memory is possible – Register files made of flip flops operate at processor speeds – Such memory is Static RAM (SRAM) • Tradeoff – SRAM is fast – Economically unfeasible for large memories 9 Memory Hierarchy • SRAM – – – – High power consumption Large area on die Long delays if used for large memories Costly per bit • DRAM – – – – – Low power consumption Suitable for Large Scale Integration (LSI) Small size Ideal for large memories Circa 2007, a single DRAM chip may contain up to 256 Mbits with an access time of 70 ns. 9 Memory Hierarchy Source: http://www.storagesearch.com/semico-art1.html 9.1 The Concept of a Cache • Feasible to have small amount of fast memory and/or large amount of slow memory. Increasing speed Increasing size as • Want as we get closer to we get farther away – Size advantage of DRAM – Speed advantage of SRAM. • CPU looks in cache for data it seeks from main memory. • If data not there it retrieves it from main memory. • If the cache is able to service "most" CPU requests then effectively we will get speed advantage of cache. • All addresses in cache are also in memory the processor from the processor CPU Cache Main memory 9.2 Principle of Locality 9.2 Principle of Locality • A program tends to access a relatively small region of memory irrespective of its actual memory footprint in any given interval of time. While the region of activity may change over time, such changes are gradual 9.2 Principle of Locality • Spatial Locality: Tendency for locations close to a location that has been accessed to also be accessed • Temporal Locality: Tendency for a location that has been accessed to be accessed again • Example for(i=0; i<100000; i++) a[i] = b[i]; 9.3 Basic terminologies • Hit: CPU finding contents of memory address in cache • Hit rate (h) is probability of successful lookup in cache by CPU. • Miss: CPU failing to find what it wants in cache (incurs trip to deeper levels of memory hierarchy • Miss rate (m) is probability of missing in cache and is equal to 1-h. • Miss penalty: Time penalty associated with servicing a miss at any particular level of memory hierarchy • Effective Memory Access Time (EMAT): Effective access time experienced by the CPU when accessing memory. – Time to lookup cache to see if memory location is already there – Upon cache miss, time to go to deeper levels of memory hierarchy EMAT = Tc + m * Tm where m is cache miss rate, Tc the cache access time and Tm the miss penalty 9.4 Multilevel Memory Hierarchy • Modern processors use multiple levels of caches. • As we move away from processor, caches get larger and slower • EMATi = Ti + mi * EMATi+1 • where Ti is access time for level i • and mi is miss rate for level i 9.4 Multilevel Memory Hierarchy 9.5 Cache organization • There are three facets to the organization of the cache: 1. Placement: Where do we place in the cache the data read from the memory? 2. Algorithm for lookup: How do we find something that we have placed in the cache? 3. Validity: How do we know if the data in the cache is valid? 9.6 Direct-mapped cache organization 0 1 Cache 2 3 4 0 5 1 6 2 7 3 8 4 9 5 10 6 11 7 12 13 14 15 9.6 Direct-mapped cache organization 0 1 2 3 4 5 6 0 8 16 24 7 1 9 17 25 2 10 18 26 3 11 19 27 4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31 9.6 Direct-mapped cache organization 000 001 010 011 100 101 110 00000 01000 10000 11000 111 00001 01001 10001 11001 00010 01010 10010 11010 00011 01011 10011 11011 00100 01100 10100 11100 00101 01101 10101 11101 00110 01110 10110 11110 00111 01111 10111 11111 9.6.1 Cache Lookup 000 001 010 Cache_Index = Memory_Address mod Cache_Size 011 100 101 110 00000 01000 10000 11000 111 00001 01001 10001 11001 00010 01010 10010 11010 00011 01011 10011 11011 00100 01100 10100 11100 00101 01101 10101 11101 00110 01110 10110 11110 00111 01111 10111 11111 Tag Contents 9.6.1 Cache Lookup 000 00 001 00 010 00 011 00 Cache_Index = Memory_Address mod Cache_Size Cache_Tag = Memory_Address/Cache_Size 100 00 101 00 110 00 00000 01000 10000 11000 111 00 00001 01001 10001 11001 00010 01010 10010 11010 00011 01011 10011 11011 00100 01100 10100 11100 00101 01101 10101 11101 00110 01110 10110 11110 00111 01111 10111 11111 9.6.1 Cache Lookup • Keeping it real! • Assume – 4Gb Memory: 32 bit address – 256 Kb Cache – Cache is organized by words • 1 Gword memory • 64 Kword cache 16 bit cache index Memory Address Breakdown Cache Line Tag Index Byte Offset 00000000000000 0000000000000000 00 Index Tag Contents 0000000000000000 00000000000000 00000000000000000000000000000000 Sequence of Operation Processor emits 32 bit address to cache Tag Index Byte Offset 10101010101010 0000000000000010 00 Index Tag Contents 0000000000000000 00000000000000 00000000000000000000000000000000 Index Tag Contents 0000000000000001 00000000000000 00000000000000000000000000000000 Index Tag Contents 0000000000000010 10101010101010 00000000000000000000000000000000 Index Tag Contents 1111111111111111 00000000000000 00000000000000000000000000000000 Thought Question Assume computer is turned on and every location in cache is zero. What can go wrong? Processor emits 32 bit address to cache Tag Index Byte Offset 00000000000000 0000000000000010 00 Index Tag Contents 0000000000000000 00000000000000 00000000000000000000000000000000 Index Tag Contents 0000000000000001 00000000000000 00000000000000000000000000000000 Index Tag Contents 0000000000000010 00000000000000 00000000000000000000000000000000 Index Tag Contents 1111111111111111 00000000000000 00000000000000000000000000000000 Add a Bit! Each cache entry contains a bit indicating if the line is valid or not. Initialized to invalid Processor emits 32 bit address to cache Tag Index Byte Offset 00000000000000 0000000000000010 00 Index V Tag Contents 0000000000000000 0 00000000000000 00000000000000000000000000000000 Index V Tag Contents 0000000000000001 0 00000000000000 00000000000000000000000000000000 Index Contents V Tag 0000000000000010 0 00000000000000 00000000000000000000000000000000 V Tag Index Contents 1111111111111111 0 00000000000000 00000000000000000000000000000000 9.6.2 Fields of a Cache Entry • Is the sequence of fields significant? Tag Index Byte Offset 00000000000000 0000000000000000 00 • Would this work? Index Tag Byte Offset 0000000000000000 00000000000000 00 9.6.3 Hardware for direct mapped cache Memory address Cache Tag Valid Tag Data . . . . Cache Index . . Data To CPU = hit y Question • Does the caching concept described so far exploit 1. Temporal locality 2. Spatial locality 3. Working set 9.7 Repercussion on pipelined processor design ID/RR IF PC I-Cache ALU B U F F E R DPRF A B Decode logic EXEC B U F F E R ALU-1 ALU-2 MEM B U F F E R D-Cache WB B U F F E R data DPRF • Miss on I-Cache: Insert bubbles until contents supplied • Miss on D-Cache: Insert bubbles into WB stall IF, ID/RR, EXEC 9.8 Cache read/write algorithms Read Hit 9.8 Basic cache read/write algorithms Read Miss 9.8 Basic cache read/write algorithms Write-Back 9.8 Basic cache read/write algorithms Write-Through 9.8.1 Read Access to Cache from CPU • CPU sends index to cache. Cache looks iy up and if a hit sends data to CPU. If cache says miss CPU sends request to main memory. All in same cycle (IF or MEM in pipeline) • Upon sending address to memory CPU sends NOP's down to subsequent stage until data read. When data arrives it goes to CPU and the cache. 9.8.2 Write Access to Cache from CPU • Two choices – Write through policy • Write allocate • No-write allocate – Write back policy 126.96.36.199 Write Through Policy • Each write goes to cache. Tag is set and valid bit is set • Each write also goes to write buffer (see next slide) 188.8.131.52 Write Through Policy CPU Address Data Address Data Address Data Address Data Address Write Buffer Data Address Data Main memory Write-Buffer for Write-Through Efficiency 184.108.40.206 Write Through Policy • Each write goes to cache. Tag is set and valid bit is set – This is write allocate – There is also a no-write allocate where the cache is not written to if there was a write miss • Each write also goes to write buffer • Write buffer writes data into main memory – Will stall if write buffer full 220.127.116.11 Write back policy • CPU writes data to cache setting dirty bit – Note: Cache and memory are now inconsistent but the dirty bit tells us that 18.104.22.168 Write back policy • • • • • We write to the cache We don't bother to update main memory Is the cache consistent with main memory? Is this a problem? Will we ever have to write to main memory? 22.214.171.124 Write back policy 126.96.36.199 Comparison of the Write Policies • Write Through – Cache logic simpler and faster – Creates more bus traffic • Write back – Requires dirty bit and extra logic • Multilevel cache processors may use both – L1 Write through – L2/L3 Write back 9.9 Dealing with cache misses in the processor pipeline • Read miss in the MEM stage: I1: I2: I3: I4: I5: ld add and add add r1, r3, r6, r2, r2, a r4, r7, r4, r1, r5 r8 r5 r2 ; ; ; ; ; r1 r3 r6 r2 r2 <<<<<- MEM[a] r4 + r5 r7 AND r8 r4 + r5 r1 + r2 • Write miss in the MEM stage: The write-buffer alleviates the ill effects of write misses in the MEM stage. (WriteThrough) 9.9.1 Effect of Memory Stalls Due to Cache Misses on Pipeline Performance • ExecutionTime = NumberInstructionsExecuted * CPIAvg * clock cycle time • ExecutionTime = (NumberInstructionsExecuted * (CPIAvg + MemoryStallsAvg) ) * clock cycle time • EffectiveCPI = CPIAvg + MemoryStallsAvg • TotalMemory Stalls = NumberInstructions * MemoryStallsAvg • MemoryStallsAvg = MissesPerInstructionAvg * MissPenaltyAvg 9.9.1 Improving cache performance • Consider a pipelined processor that has an average CPI of 1.8 without accounting for memory stalls. I-Cache has a hit rate of 95% and the D-Cache has a hit rate of 98%. Assume that memory reference instructions account for 30% of all the instructions executed. Out of these 80% are loads and 20% are stores. On average, the read-miss penalty is 20 cycles and the write-miss penalty is 5 cycles. Compute the effective CPI of the processor accounting for the memory stalls. 9.9.1 Improving cache performance • Cost of instruction misses = I-cache miss rate * read miss penalty = (1 - 0.95) * 20 = 1 cycle per instruction • • • Cost of data read misses = fraction of memory reference instructions in program * fraction of memory reference instructions that are loads * D-cache miss rate * read miss penalty = 0.3 * 0.8 * (1 – 0.98) * 20 = 0.096 cycles per instruction Cost of data write misses = fraction of memory reference instructions in the program * fraction of memory reference instructions that are stores * D-cache miss rate * write miss penalty = 0.3 * 0.2 * (1 – 0.98) * 5 = 0.006 cycles per instruction Effective CPI = base CPI + Effect of I-Cache on CPI + Effect of D-Cache on CPI = 1.8 + 1 + 0.096 + 0.006 = 2.902 9.9.1 Improving cache performance • Bottom line…Improving miss rate and reducing miss penalty are keys to improving performance 9.10 Exploiting spatial locality to improve cache performance • So far our cache designs have been operating on data items the size typically handled by the instruction set e.g. 32 bit words. This is known as the unit of memory access • But the size of the unit of memory transfer moved by the memory subsystem does not have to be the same size • Typically we can make memory transfer something that is bigger and is a multiple of the unit of memory access 9.10 Exploiting spatial locality to improve cache performance • For example Word Word Word Word Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Cache Block • Our cache blocks are 16 bytes long • How would this affect our earlier example? – 4Gb Memory: 32 bit address – 256 Kb Cache 9.10 Exploiting spatial locality to improve cache performance • • • • • • Block size 16 bytes 4Gb Memory: 32 bit address 256 Kb Cache Total blocks = 256 Kb/16 b = 16K Blocks Need 14 bits to index block How many bits for block offset? Word Word Word Word Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte Cache Block Block size 16 bytes 4Gb Memory: 32 bit address 256 Kb Cache Total blocks = 256 Kb/16 b = 16K Blocks Need 14 bits to index block How many bits for block offset? 16 bytes (4 words) so 4 (2) bits Tag Block Index Block 14 bits 14 bits Offset 00000000000000 00000000000000 0000 Byte Offset • • • • • • • Word Offset 9.10 Exploiting spatial locality to improve cache performance Tag Block Index 14 bits 14 bits 00000000000000 00000000000000 00 00 9.10 Exploiting spatial locality to improve cache performance 9.10 Exploiting spatial locality to improve cache performance CPU, cache, and memory interactions for handling a write-miss Dirty bits may or may not be the same story! 9.10.1 Performance implications of increased blocksize • We would expect that increasing the block size will lower the miss rate. • Should we keep on increasing block up to the limit of 1 block per cache!?!?!? 9.10.1 Performance implications of increased blocksize No, as the working set changes over time a bigger block size will cause a loss of efficiency Question • Does the multiword block concept just described exploit 1. Temporal locality 2. Spatial locality 3. Working set 9.11 Flexible placement • Imagine two areas of your current working set map to the same area in cache. • There is plenty of room in the cache…you just got unlucky • Imagine you have a working set which is less than a third of your cache. You switch to a different working set which is also less than a third but maps to the same area in the cache. It happens a third time. • The cache is big enough…you just got unlucky! 9.11 Flexible placement WS 1 Unused WS 2 Cache WS 3 Memory footprint of a program 9.11 Flexible placement • What is causing the problem is not your luck • It's the direct mapped design which only allows one place in the cache for a given address • What we need are some more choices! • Can we imagine designs that would do just that? 9.11.1 Fully associative cache • As before the cache is broken up into blocks • But now a memory reference may appear in any block • How many bits for the index? • How many for the tag? 9.11.1 Fully associative cache 9.11.2 Set associative caches VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data Two-way Set Associative Direct Mapped (1-way) Four-way Set Associative VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data Fully Associative (8-way) VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data 9.11.2 Set associative caches Assume we have a computer with 16 bit addresses and 64 Kb of memory Further assume cache blocks are 16 bytes long and we have 128 bytes available for cache data Cache Type Cache Lines Ways Tag Index bits Block Offset (bits) Direct Mapped 8 1 9 3 4 Two-way Set Associative 4 2 10 2 4 Four-way Set Associative 2 4 11 1 4 Fully Associative 1 8 12 0 4 9.11.2 Set associative caches Tag Byte Offset (2 bits) Index 10 Tag 20 V Data Tag V Data Tag V Data Tag 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 1021 1021 1021 1021 1022 1022 1022 1022 1023 1023 1023 1023 = = 32 = 32 hit Data Data = 32 4 to 1 multiplexor V 32 9.11.3 Extremes of set associativity Direct Mapped (1-way) 8 sets 1-Way VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data Two-way Set Associative 4 sets 2 Ways VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data 4 Ways Four-way Set Associative 2 sets Fully Associative (8-way) 1 set VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data 8 Ways VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data 9.12 Instruction and Data caches • Would it be better to have two separate caches or just one larger cache with a lower miss rate? • Roughly 30% of instructions are Load/Store requiring two simultaneous memory accesses • The contention caused by combining caches would cause more problems than it would solve by lowering miss rate 9.13 Reducing miss penalty • Reducing the miss penalty is desirable • It cannot be reduced enough just by making the block size larger due to diminishing returns • Bus Cycle Time: Time for each data transfer between memory and processor • Memory Bandwidth: Amount of data transferred in each cycle between memory and processor 9.14 Cache replacement policy • An LRU policy is best when deciding which of the multiple "ways" to evict upon a cache miss Type Cache Bits to record LRU Direct Mapped N/A 2-Way 1 bit/line 4-Way ? bits/line 9.15 Recapping Types of Misses • Compulsory: Occur when program accesses memory location for first time. Sometimes called cold misses • Capacity: Cache is full and satisfying request will require some other line to be evicted • Conflict: Cache is not full but algorithm sends us to a line that is full • Fully associative cache can only have compulsory and capacity • Compulsory>Capacity>Conflict 9.16 Integrating TLB and Caches CPU VA PA TLB Cache Instruction or Data 9.17 Cache controller • Upon request from processor, looks up cache to determine hit or miss, serving data up to processor in case of hit. • Upon miss, initiates bus transaction to read missing block from deeper levels of memory hierarchy. • Depending on details of memory bus, requested data block may arrive asynchronously with respect to request. In this case, cache controller receives block and places it in appropriate spot in cache. • Provides ability for the processor to specify certain regions of memory as “uncachable.” 9.18 Virtually Indexed Physically Tagged Cache VPN Page offset Index Cache TLB PFN Tag =? Hit Data 9.19 Recap of Cache Design Considerations • • • • • • • • • • • • • Principles of spatial and temporal locality Hit, miss, hit rate, miss rate, cycle time, hit time, miss penalty Multilevel caches and design considerations thereof Direct mapped caches Cache read/write algorithms Spatial locality and blocksize Fully- and set-associative caches Considerations for I- and D-caches Cache replacement policy Types of misses TLB and caches Cache controller Virtually indexed physically tagged caches 9.20 Main memory design considerations • A detailed analysis of a modern processors memory system is beyond the scope of the book • However, we present some concepts to illustrate some of the types of designs one might find in practice 9.20.1 Simple main memory Cache CPU Address Data Address (32 bits) Data (32 bits) Main memory (32 bits wide) 9.20.2 Main memory and bus to match cache block size Cache CPU Address Data Address (32 bits) Data (128 bits) Main memory (128 bits wide) 9.20.3 Interleaved memory Data (32 bits) Memory Bank M0 (32 bits wide) Block Address Memory Bank M1 (32 bits wide) CPU (32 bits) Memory Bank M2 (32 bits wide) Cache Memory Bank M3 (32 bits wide) 9.21 Elements of a modern main memory systems 9.21 Elements of a modern main memory systems 9.21 Elements of a modern main memory systems 9.21.1 Page Mode DRAM 9.22 Performance implications of memory hierarchy Type of Memory Typical Size Approximate latency in CPU clock cycles to read one word of 4 bytes CPU registers 8 to 32 Usually immediate access (0-1 clock cycles) L1 Cache L2 Cache 32 (Kilobyte) KB to 128 KB 128 KB to 4 Megabyte (MB) 256 MB to 4 Gigabyte (GB) 1 GB to 1 Terabyte (TB) 3 clock cycles 10 clock cycles Main (Physical) Memory Virtual Memory (on disk) 100 clock cycles 1000 to 10,000 clock cycles (not accounting for the software overhead of handling page faults) 9.23 Summary Category Principle of locality (Section 9.2) Vocabulary Spatial Temporal Cache organization Direct-mapped Fully associative Set associative Cache reading/writing (Section 9.8) Read hit/Write hit Read miss/Write miss Cache write policy (Section 9.8) Write through Write back Details Access to contiguous memory locations Reuse of memory locations already accessed One-to-one mapping (Section 9.6) One-to-any mapping (Section 9.12.1) One-to-many mapping (Section 9.12.2) Memory location being accessed by the CPU is present in the cache Memory location being accessed by the CPU is not present in the cache CPU writes to cache and memory CPU only writes to cache; memory updated on replacement 9.23 Summary Category Cache parameters Vocabulary Total cache size (S) Details Total data size of cache in bytes Block Size (B) Size of contiguous data in one data block Number of homes a given memory block can reside in a cache S/pB Time in CPU clock cycles to check hit/miss in cache Size of data exchange between CPU and cache Size of data exchange between cache and memory Time in CPU clock cycles to handle a cache miss log2L bits, used to look up a particular cache line log2B bits, used to select a specific byte within a block a – (n+b) bits, where a is number of bits in memory address; used for matching with tag stored in the cache Degree of associativity (p) Number of cache lines (L) Cache access time Unit of CPU access Unit of memory transfer Miss penalty Memory address interpretation Index (n) Block offset (b) Tag (t) 9.23 Summary Category Vocabulary Cache entry/cache block/cache line/set Valid bit Dirty bits Performance metrics Types of misses Replacement policy Memory technologies Main memory Details Signifies data block is valid For write-back, signifies if the data block is more up to date than memory Tag Used for tag matching with memory address for hit/miss Data Actual data block Hit rate (h) Percentage of CPU accesses served from the cache Miss rate (m) 1–h Avg. Memory stall Misses-per-instructionAvg * misspenaltyAvg Effective memory access time (EMATi) at EMATi = level i Ti + mi * EMATi+1 Effective CPI CPIAvg + Memory-stallsAvg Compulsory miss Memory location accessed for the first time by CPU Conflict miss Miss incurred due to limited associativity even though the cache is not full Capacity miss Miss incurred when the cache is full FIFO First in first out LRU Least recently used SRAM Static RAM with each bit realized using a flip flop DRAM Dynamic RAM with each bit realized using a capacitive charge DRAM access time DRAM read access time DRAM cycle time DRAM read and refresh time Bus cycle time Data transfer time between CPU and memory Simulated interleaving using DRAM Using page mode bits of DRAM 9.24 Memory hierarchy of modern processors – An example • AMD Barcelona chip (circa 2006). Quad-core. • Per core L1 (split I and D) – 2-way set-associative (64 KB for instructions and 64 KB for data). • L2 cache. – 16-way set-associative (512 KB combined for instructions and data). • L3 cache that is shared by all the cores. – 32-way set-associative (2 MB shared among all the cores). 9.24 Memory hierarchy of modern processors – An example Questions?