Memory Hierarchy

Morton Order Improves Performance

Kerry Evans , in High Performance Parallelism Pearls, 2015

Improving cache locality by information ordering

Modern retentivity hierarchies exercise their best to keep your processors decorated past providing the required information at the right time. But what if your data are not structured in a enshroud-friendly way? That is when nosotros usually resort to some kind of data reordering, blocking, or tiling, and often end up with code that is very difficult to understand (and maintain) and strongly tied to our target hardware. Developers are used to the idea of storing data in multidimensional matrices in either row- or column-major gild. Only what if your code needs to access data in a unlike order? What if you are really interested in the four or 6 nearest neighbors rather than 16 elements in a single row? What if you need to access a row-major matrix in column-major guild? Worse yet, what if your data must exist traversed in many different orders depending on previous calculations? Bold all your data does not fit in cache, performance will suffer!

A large body of work exists on efficient mappings of multidimensional data along with algorithms to efficiently utilize these mappings. Many of the index calculations are complex and require special hardware or instructions to justify their usage.

In this chapter, nosotros will examine a method of mapping multidimensional data into a unmarried dimension while maintaining locality using Morton, or Z-curve ordering, and wait at the effects it has on performance of two common linear algebra issues: matrix transpose and matrix multiply. Next, we will melody our transpose and multiply code to accept advantage of the Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor cache and vector hardware, add threading, and finally summarize the results.

Read total chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128021187000285

GPU sorting algorithms

M. Gopi , in Advances in GPU Research and Practice, 2017

i.3 CUDA Memory Model

The retentiveness hierarchy used past CUDA is supported past the SM and GPU architectures. Within each processor inside the chip, we noted that in that location are registers that are accessible per thread and that this space is valid until that thread is live. If a thread uses more registers than are bachelor, the system automatically uses "Local memory" which is actually the off-chip memory on the GPU menu (device). So, although the data can exist transparently fetched from the local memory as though information technology were in the annals, the latency of this data fetch is every bit high as the information fetched from the global retentivity, for the simple reason that "local" retentivity is merely a office of allocated global memory. The "shared" memory is an on-chip memory like registers, but are allocated per-block, and the data in the shared memory is valid until the block is beingness executed past the processor. Global retentiveness, as mentioned earlier, is off-chip, but on the GPU card. This memory is attainable to all threads of all kernels, equally well as the host (CPU). Data sharing between threads in different blocks of the same kernel or even different kernels tin can exist washed using the global memory. The host (CPU) memory, which is the slowest from the GPU perspective, is not direct accessible by CUDA threads, only the data has to exist explicitly transferred from the host retention to the device memory (global memory). Withal, CUDA 6 introduces unified retention by which the data in the host retention tin can be directly indexed from the GPU side without explicitly transferring data between the host and the device. Finally, advice between unlike GPUs has to become through the PCI limited coach and through the host memory. This is clearly the most expensive mode of advice. However, the latest NVLink, a ability-efficient, high-speed bus between the CPU and the GPU and between multiple GPUs, allows much higher transfer speeds than those achievable by using PCI Express.

Read full affiliate

URL:

https://world wide web.sciencedirect.com/scientific discipline/commodity/pii/B9780128037386000124

Biological sequence assay on GPU

Northward. Moreano , A.C.One thousand.A.de Melo , in Advances in GPU Inquiry and Practice, 2017

three.3 Use of GPU Retentiveness Hierarchy

GPU memory hierarchy includes several memories with very different features, such as latency, bandwidth, read-only or read-write access, and and so on. For instance, in the CUDA architecture, there are registers, shared, global, abiding, and texture memories [ 51].

Memory accesses usually have a great impact on GPU programs performance; therefore virtually proposals effort to optimize the memory layout and usage patterns of the data structures used by the implemented algorithm. SW, BLAST, Viterbi, and MSV algorithms include several large data structures, such every bit the dynamic programming score matrices and the sequences. The last two algorithms also include the transition and emission probabilities of the profile HMM.

The retentivity where each information structure is allocated is chosen based on the size of that structure and the types of access performed on information technology (few or many accesses, only reads or reads and writes, several threads reading from the same position, etc.). A item data construction may be reorganized in order to provide a ameliorate usage pattern regarding the memory information technology is allocated.

Several solutions for pairwise or sequence-contour comparison [33–37, 43–45, 49, 50] allocate their data structures in GPU global memory and some of them rearrange the memory accesses of threads into favorable patterns in social club to achieve memory coalescing. The thought is that when the threads in a warp admission consecutive global memory locations, the hardware is able to combine all accesses into a single memory request. Such coalesced accesses allow GPU global memory to deliver data at a charge per unit shut to its pinnacle bandwidth [51].

Registers, shared, constant, and texture memories tin can be highly constructive in reducing the number of accesses to global memory. Notwithstanding, these memories have limited chapters, which may also restrict the number of threads executing simultaneously on a GPU streaming multiprocessor.

Accessing GPU shared memory is very fast and highly parallel; therefore several proposals [29, 31, 33, 34, 36, 38, 39, 43, 45, 48, 50, 52] use this memory to agree the portion of GPU global memory data that is heavily used in an execution phase of the algorithm. In a further step, the algorithm used may be reorganized in order to create execution phases that focus heavily on pocket-sized portions of global retentiveness information.

With appropriate access patterns, accessing constant retentiveness is very fast and parallel. This memory provides brusk-latency, high-bandwidth, and read-only admission by the device when all threads simultaneously access the same location, and broadcasts the data accessed to all threads in a warp [51]. Constant memory is used past the works in Refs. [28, 29, 31, 33, 36, 45].

Texture memory tin can too be used to avoid global memory bandwidth limitations and handle memory accesses with sure admission patterns. Although texture memory was originally designed for traditional graphics applications, it tin also be used quite effectively in some full general-purpose GPU computing applications. The solutions in Refs. [27, 29, 32–35, 37, 39, 50] allocate information structures in this retention.

Read full chapter

URL:

https://world wide web.sciencedirect.com/scientific discipline/commodity/pii/B9780128037386000069

Background

Michael McCool , ... James Reinders , in Structured Parallel Programming, 2012

Retentivity Bureaucracy

Processors besides have a memory hierarchy. Closest to the functional units are modest, very fast memories known as registers. Functional units operate directly on values stored in registers. Next there are teaching and data caches. Instructions are cached separately from information at this level since their usage patterns are different. These caches are slightly slower than registers but have more space. Additional levels of cache follow, each cache level being slower but more capacious than the one above it, typically past an social club of magnitude in both respects. Admission to master memory is typically two orders of magnitude slower than access to the concluding level of cache but is much more capacious, currently upward to hundreds of gigabytes on large servers. Currently, large on-flake cache memories are on the order of 10 MB, which is nonetheless a tiny sliver of the total physical memory typically available in a mod auto.

Caches are organized into blocks of storage called cache lines. A enshroud line is typically much larger than a single word and frequently (only not e'er) bigger than a vector. Some currently common sizes for enshroud lines are 64 bytes and 128 bytes. Compared with a 128-bit SSE register, which is 16 bytes wide, we see that these cache lines are 4 to 8 SSE vector registers wide. When data is read from memory, the enshroud is populated with an entire cache line. This allows subsequent rapid access to nearby data in the same enshroud line. Transferring the unabridged line from external memory makes it possible to amortize the overhead for setting up the transfer. On-chip, wide buses can exist used to increase bandwidth betwixt other levels of the memory hierarchy. Withal, if memory accesses jump around indiscriminately in memory, the extra data read into the cache goes unused. Peak memory access performance is therefore merely obtained for coherent memory accesses, since that makes full use of the line transfers. Writes are usually more expensive than reads. This is because writes really crave reading the line in, modifying the written part, and (eventually) writing the line back out.

In that location are also two timing-related parameters to consider when discussing memory access: latency and bandwidth. Bandwidth is the amount of data that tin be transferred per unit of measurement time. Latency is the amount of fourth dimension that it takes to satisfy a transfer asking. Latency tin can frequently be a crucial gene in operation. Random reads, for case due to "pointer chasing," tin can go out the processor spending virtually of its time waiting for information to be returned from off-fleck retention. This is a good case where hardware multithreading on a unmarried cadre be beneficial, since while 1 thread is waiting for a memory read another tin be doing computation.

Caches maintain copies of information stored elsewhere, typically in main memory. Since caches are smaller than chief memory, but a subset of the data in the retentivity (or in the next larger cache) tin exist stored, and bookkeeping data needs to exist maintained to keep track of where the data came from. This is the other reason for using cache lines: to amortize the cost of the bookkeeping. When an address is accessed, the caches need to be searched chop-chop to determine if that address' information is in enshroud. A fully associative cache allows whatsoever accost' data to exist stored anywhere in the enshroud. It is the most flexible kind of cache merely expensive in hardware because the entire enshroud must be searched. To do this rapidly, a large number of parallel hardware comparators are required.

At the other extreme are direct-mapped caches. In a direct-mapped cache, data can be placed in simply one location in cache, typically using a modular function of the address. This is very unproblematic. All the same, if the programme happens to admission two different main memory locations that map to the same location in the cache, information will become swapped into that same location repeatedly, defeating the cache. This is called a cache disharmonize. In a direct-mapped enshroud, main memory locations with conflicts are located far apart, so a conflict is theoretically rare. However, these locations are typically located at a power of two separation, then certain operations (like accessing neighboring rows in a large prototype whose dimensions are a power of two) can be pathological.

A fix-associative enshroud is a mutual compromise betwixt full associativity and straight mapping. Each memory address maps to a gear up of locations in the cache; hence, searching the enshroud for an address involves searching merely the set it maps to, not the unabridged enshroud. Pathological cases where many accesses hit the same set tin occur, but they are less frequent than for direct-mapped caches. Interestingly, a k-manner set associative enshroud (ane with k elements in each gear up) tin be implemented using k direct-mapped caches plus a small corporeality of boosted external hardware. Ordinarily m is a small number, such as 4 or 8, although it is as large equally 16 on some recent Intel processors.

Caches farther down in the hierarchy are typically besides shared among an increasing number of cores. Special hardware keeps the contents of caches consistent with one some other. When cores communicate using "shared memory," they are often actually just communicating through the enshroud coherence mechanisms. Another pathological case tin can occur when two cores access data that happens to lie in the same cache line. Normally, cache coherency protocols assign i core, the i that last modifies a cache line, to be the "possessor" of that cache line. If 2 cores write to the aforementioned cache line repeatedly, they fight over ownership. Importantly, note that this can happen even if the cores are non writing to the same role of the cache line. This trouble is called false sharing and can significantly decrease performance. In item, as noted in Section two.three, this leads to a significant difference in the benefit of retentivity coherence in threads and vector mechanisms for parallelism.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780124159938000025

Architecting the concluding-level enshroud for GPUs using STT-MRAM nonvolatile retentiveness

Grand.H. Samavatian , ... H. Sarbazi-Azad , in Advances in GPU Inquiry and Practice, 2017

GPU memory hierarchy

Many researchers have considered the memory hierarchy of GPUs to reduce access latency of the retention [ 31], improve retentivity admission coalescing [23], and increase the parallelism among memory controllers [32]. DRAM scheduling [33] and prefetching [34] techniques take too been introduced for GPUs. In addition, a number of recent works concentrate on the GPU cache from various perspectives.

Jia et al. [35] specified the memory access patterns in GPUs. Based on such information, the suggested algorithm estimates how much of the bandwidth is required when cache retentivity is on/off and and then decides which instruction data is placed in the enshroud. In Ref. [36], Tarjan et al. tried to quickly cross the interconnection network to access L2 by handling L1 misses using the copies kept at other L1 caches in other SMs. To exercise and then, they used a sharing tracker to place whether a missed data block in L1 cache is present in other L1 caches. Rogers et al. [2] presented a method called Cache-Witting Wavefront that employs a hardware unit to examine intrawarp locality loss when a warp is selected for fetching. Hence, their mechanism prevents the issuing of new warps that might result in thrashing previously scheduled warps, data in the cache. Similarly, in Refs. [37–39], the same trouble is tackled by modifying the cache indexing part, warp throttling, and detecting critical warps in order to maintain useful data in the enshroud. Gebhart et al. [40] designed a compatible memory that can exist configured to be annals file, cache, or shared memory regarding the requirements of the running application. Moreover some other works, such equally [41, 42], tried to reduce the power consumption of GPU by observing and considering GPU retention bureaucracy from the primary memory to the register file.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780128037386000203

Hardware and Application Profiling Tools

Tomislav Janjusic , Krishna Kavi , in Advances in Computers, 2014

iii.5.six Hornet

Hornet is a parallel, highly configurable, wheel-level multicore simulator [13] intended to simulate many-core NoC systems.

Hornet supports a variety of retentivity hierarchies interconnected through routing and VC allocation algorithms. It too has the ability to model a arrangement'south power and thermal aspect.

Hornet is capable of running in two modes: a network-only mode and a full-multicore mode running a born MIPS core simulator. The goal of Hornet is the capability to simulate 1000 cadre systems.

Hornet's basic router is modeled using ingress and egress buffers. Packets make it in flits and compete for the crossbar. When a port has been assigned, they laissez passer and exit using the egress buffer. Interconnect geometry can be configured with pairwise connections to form rings, multilayer meshes, and tori. Hornet supports oblivious, static, and adaptive routing. Oblivious and adaptive routing are configurable using routing tables. Similarly VCA is handled using tables. Hornet too allows for internode bidirectional linking. Links can be changed at every bike depending on dynamic traffic needs.

To accurately correspond power dissipation and thermal properties, Hornet uses dynamic power models based on Orion [5], combined with a leakage ability model, and a thermal model using HotSpot [70].

To speedup simulation, Hornet implements a multithreaded strategy to offload simulation work to other cores. This is achieved by tiling each simulated processor cadre to a single thread. Cycle accuracy is globally visible for simulation correctness.

Hornet simulates tiles where each tile is an NoC router connected to other tiles via point-to-point links.

The traffic generators are either trace-driven injectors or cycle-level MIPS simulators.

A elementary trace injector reads a trace file that contains traces annotated with timestamps, bundle sizes, and other packet information describing each packet. The MIPS simulator can accept cross compiled MIPS binaries. The MIPS cadre tin can exist configured with varying memory levels backed by a coherency protocol.

Read full chapter

URL:

https://world wide web.sciencedirect.com/scientific discipline/commodity/pii/B9780124202320000039

HSA Simulators

Y.-C. Chung , ... R. Ubal , in Heterogeneous System Architecture, 2016

ix.1.5.2 Memory systems

Multi2Sim-HSA strictly follows the retentiveness hierarchy defined in the HSA specification [ 6], but also creates its ain retentivity organization, as shown in Figure ix.3. There is a single memory object that requests memory from the host environs and manages the guest retentiveness space. The retentiveness object is managed as a flat memory address space. Memory allocation and deallocations is delegated to a memory manager, which runs a best-fit algorithm. The retentivity managing director is also responsible for allocating the retention for global variables.

Figure 9.iii. The segment management organisation.

The segmented retentivity space requires an address translation between the inner-segment address and the flat accost. Therefore, the segment memory manager is created to consul operations involving non-global segments. When a retentivity segment is requested, the respective retentiveness segment manager is invoked and the requested amount of retentivity is allocated via the global retention managing director. For instance, when a kernel is launched, the AQL packet explicitly requests an corporeality of memory required to concur the group segment and the private segment. Whenever a work-group or a piece of work-detail is initiated, the memory is allocated from the global segment and is marked to announce the specific blazon of segment. After completing these steps for each variable, all variables that are alleged in the segment take an accost defined relative to the beginning of the segment. When accessing those variables, the segment manager first translates the inner-segment accost to a flat address, such that a work-particular can so upshot loads or stores to memory directly.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128003862000080

Computer Data Processing Hardware Architecture

Paul J. Fortier , Howard E. Michel , in Computer Systems Performance Evaluation and Prediction, 2003

Retentivity management

An operating organization's storage director manages the retentivity hierarchy of the computer. The operating system in particular must coordinate the motility of information into and out of the computer'southward master retentivity, as well equally the maintenance of the retentiveness's free space. To perform these functions an operating system typically uses a scheme where the primary memory is broken upward into fixed-size pieces called pages or variable-sized pieces chosen segments. The operating system then manages the movement of pages or segments in retentivity based on policies in employ. The retentiveness manager must allocate infinite for processes upon initiation, deallocate infinite when a process completes, and periodically clean upward the memory space when the retentivity becomes fragmented due to allocation and deallocation of uneven partitions. The memory allocation problem is directly tied to the retention map. (Come across Figure two.23.)

Figure 2.23. Retentivity map.

The memory map indicates which areas in retentiveness are allocated to a process and which areas are free to be allocated to a new process. This memory map can exist managed in a variety of ways to help the allotment managing director. The list of free areas can be organized into a free list, where the blocks are structured as a tree of increasing block size, or as a heap, with the largest block always toward the top of the heap. Retentivity allocation then becomes a function of selecting a block of advisable size based on the selection policy in place. Some policies include commencement fit, where the first cake encountered that fits this process is selected. Some other policy is best fit, where the blocks are scanned until one is found that best fits the size of the process to be loaded into the memory. There are numerous other schemes, but they are beyond the scope of this chapter.

Hand in hand with allocation is deallocation of retentiveness. As pages or segments are released by processes leaving the running land, they must be removed from the allocated list and replaced into the complimentary list of free pages or segments. The deallocated segments are restored to the list in a block equal to the size of the allocated process that held them. These free segments are then placed into the free list in a location advisable to the size of the free segments being restored.

However, not all replacements are done in such a nice manner on process execution boundaries. Most are performed on a full or nigh-full principal memory. In social club to still allow processes to motility frontwards in their execution, we must reorder the agile pages by some policy that will allow us to remove some agile pages and let them be reallocated to other more demanding or starved-out processes. The most common page replacement algorithm and deallocation policy is based on the least recently used (LRU) principle. This principle indicates that the to the lowest degree recently used page is nigh likely to stay that fashion for the foreseeable future and, therefore, is a prime number candidate to be removed and replaced by a waiting process. Other schemes used for folio replacement include most recently used, least ofttimes used, and random removal. All of these policies take been examined in detail in the past and have merits for certain process activities, although for database systems some of these are downright disastrous. The database process acts in a way that is not typical of almost applications and, therefore, will not react the same to a certain policy.

Another job for memory management is to maintain a map of free memory areas and to periodically make clean upwards memory to costless up larger contiguous chunks to make resource allotment easier. This process is called garbage collection and reallocation. The allocation and deallocation policies discussed previously result in retentivity condign periodically fragmented. When retentiveness is fragmented into very fine fragments, it may become incommunicable to find contiguous blocks of costless retentiveness to allocate to incoming processes (Figure 2.24). To rectify this trouble, retentivity management services periodically check the map of memory to determine if cleaning upward the loose fragmented free blocks into larger segments will result in significant increases in costless contiguous blocks of sufficient size.

Figure two.24. Fragmented retention.

One technique scans all marked free blocks and coalesces side by side holes into marked, larger free segments. These are and then added to the free listing with the coalesced disjoint holes removed from the gratis listing (Effigy 2.25).

Figure 2.25. Mark costless blocks in retentiveness.

This in itself may not effect in sufficient free infinite of acceptable size. To get larger gratis blocks it may be necessary to periodically scan the entire retention and reallocate where processes are stored to clean upward the memory allocation map into two areas—ane a face-to-face surface area consisting of all allocated memory blocks and the other all free memory blocks. The process by which all allocated blocks are moved and reallocated to i stop of memory is called compaction, and the process for reallocating all of the newly freed space into the free list is referred to as garbage drove (Figure ii.26). As with a garbage truck, compaction strives to compress the contents into one end of the container, freeing upwards the rest of the space for more garbage. The process requires the reallocation and move of processes and their addresses (all references must exist changed in PCB and physical load segments).

Figure 2.26. Memory after garbage collection.

Beyond these bones retentiveness direction schemes some operating systems, along with back up hardware and software, back up both paging and segmentation. In this scheme the retention is decomposed into segments. A segment has some number of pages, and a page is of a fixed size. The segments are mapped into and out of memory equally pages were in the first scheme (see Figure ii.27).

Figure ii.27. Memory with both paging and segmentation.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781555582609500023

Pitfalls and Issues of Manycore Programming

Ami Marowka , in Advances in Computers, 2010

two.2 The Enshroud Coherence Problem

Maintaining the coherence property of a multilevel cache-memory hierarchy ( Figs. 5 and six) incurs another serious performance problem known as the cache coherence problem. An inconsistent memory view of a shared piece of data might occur when multiple caches are storing copies of that information particular. In such a case, the memory subsystem is interim to insure a coherent retentiveness view for all cores. Figure vii illustrates the coherency problem. Ii different cores, A and B, read the value of a shared variable 5 (Fig. 7A and B). Then, cadre B writes a new value to variable V (Fig. 7C). At this point, cadre A has an obsolete value of variable 5 (without knowing that the value is non valid anymore). Therefore, immediate action is required to invalidate the copy of variable V stored in the cache of cadre A. Otherwise, the inconsistent view of the main memory will atomic number 82 to unexpected results.

Fig. 7. Illustration of the cache-coherence problem.

Multicore processors utilise coherency control, called Snoopy Coherency Protocols, in the writing of shared data. Two classes of protocols are nearly mutual: write-invalidate and write-update. The write-invalidate protocol invalidates all cached copies of a variable before writing its new value. The write-update protocol broadcasts the newly cached copy to update all other cached copies with the same value. The underlying mechanisms that implement these protocols constantly monitor the caching events across the bus between the cores and the memory subsystem and act accordingly, hence the term snoopy protocols.

The retention-subsystems architectures of multicore processors are designed equally bus-based systems that support Uniform Retentivity Access (UMA). However, it is probable that memory-subsystems architectures of manycore processors volition exist network-based that back up Nonuniform Memory Admission (NUMA). Since snoopy coherency protocols are bus-based protocols, they are non appropriate for scalable NUMA-based architectures. For applying coherency control in NUMA systems, Directory-Based Coherency Protocols are usually used. The directory approach is to apply a single directory to record sharing information near every cached data particular. The protocol constantly keeps runway of which cores have copies of whatsoever cached information particular, so that when a write performance is invoked to a item data item, its copies can exist invalidated.

Coherency protocols incur substantial overhead that inhibits scaling up the multicore systems. An on-fleck deeper multilevel enshroud-memory hierarchy makes the cache coherence trouble harder. Table II lists the latency costs that an application tin endure in loading data on an Intel Core Duo system [48]. As can be observed, a deeper cache-memory hierarchy leads to college latencies. At present information technology is clearer what the obstacle is and where it lies to inhibit meliorate speedup in the instance of Section 1.4 (Fig. 3).

Tabular array 2. Latency Costs of Cache-Memories on an Intel Core Duo System

Case Information location Latency (cycles/ns)
L1 to L1 Cache L1 Cache xiv core cycles + v.5 double-decker cycles
Through L2 Cache L2 Cache 14 core cycles
Through Main 14 core cycles + 5.5 bus cycles
Retentivity Memory +   xl–eighty ns

Cache-retentiveness coherency direction is a programmer-transparent obstacle that can exist resolved past eliminating sharing data. Unfortunately, eliminating sharing data might lead to poor operation and nonscalable applications. Therefore, any solution must be carefully checked from all angles before final implementation. The example in Department 2.4 demonstrates the toll of the cache-retention coherency management.

Read full affiliate

URL:

https://www.sciencedirect.com/science/article/pii/S0065245810790021

Retention Systems

Sarah L. Harris , David Harris , in Digital Design and Computer Architecture, 2022

8.four Virtual Retentivity

Most mod computer systems apply a hard drive made of magnetic or solid-state storage as the lowest level in the retentivity hierarchy (run across Figure 8.iv). Compared with the platonic large, fast, cheap memory, a hard drive is large and cheap but terribly tiresome. It provides a much larger capacity than is possible with a cost-constructive chief retention (DRAM). Still, if a meaning fraction of retentivity accesses involve the hard bulldoze, performance is dismal. You lot may take encountered this on a PC when running too many programs at once.

Figure 8.19 shows a hard drive made of magnetic storage, also called a hard disk, with the chapeau of its example removed. As the name implies, the hard disk contains one or more rigid disks or platters, each of which has a read/write head on the end of a long triangular arm. The head moves to the correct location on the deejay and reads or writes data magnetically as the disk rotates beneath it. The caput takes several milliseconds to seek the correct location on the deejay, which is fast from a human being perspective just millions of times slower than the processor. Hard disk drives are increasingly being replaced past solid-state drives because reading is orders of magnitude faster (see Figure viii.4) and they are not every bit susceptible to mechanical failures.

Figure 8.19. Hd

The objective of adding a hard bulldoze to the retentiveness hierarchy is to inexpensively give the illusion of a very big retention while however providing the speed of faster memory for most accesses. A calculator with only sixteen GiB of DRAM, for example, could effectively provide 128 GiB of memory using the hard drive. This larger 128 GiB memory is called virtual retentiveness, and the smaller 16 GiB chief memory is called concrete memory. We will use the term concrete retention to refer to main memory throughout this section.

A computer with 32-bit addresses tin can access a maximum of 232 bytes = 4 GiB of retention. This is one of the motivations for moving to 64-fleck computers, which can admission far more memory.

Programs can access data anywhere in virtual retentivity, so they must use virtual addresses that specify the location in virtual retentivity. The physical memory holds a subset of most recently accessed virtual retention. In this manner, physical retention acts as a enshroud for virtual retention. Thus, about accesses striking in physical memory at the speed of DRAM, yet the program enjoys the capacity of the larger virtual memory.

Virtual memory systems employ different terminologies for the aforementioned caching principles discussed in Section viii.iii. Table 8.iii summarizes the analogous terms. Virtual memory is divided into virtual pages, typically four KiB in size. Concrete memory is as well divided into concrete pages of the same size. A virtual folio may be located in physical memory (DRAM) or on the hard bulldoze. For example, Figure 8.20 shows a virtual memory that is larger than physical memory. The rectangles indicate pages. Some virtual pages are nowadays in physical memory, and some are located on the difficult drive. The procedure of determining the physical address from the virtual accost is called address translation. If the processor attempts to access a virtual address that is not in physical memory, a page fault occurs and the operating organisation (OS) loads the page from the hard drive into concrete memory.

Table 8.3. Analogous cache and virtual retentivity terms

Enshroud Virtual Memory
Cake Folio
Cake size Page size
Block offset Page offset
Miss Page fault
Tag Virtual page number

Effigy eight.20. Virtual and physical pages

To avert page faults caused by conflicts, whatever virtual page tin can map to any concrete folio. In other words, concrete memory behaves as a fully associative cache for virtual memory. In a conventional fully associative cache, every cache block has a comparator that checks the most significant accost bits against a tag to determine whether the asking hits in the cake. In an analogous virtual retention organization, each concrete page would need a comparator to bank check the most meaning virtual address bits against a tag to determine whether the virtual page maps to that concrete page.

A realistic virtual retentiveness organization has and then many physical pages that providing a comparator for each page would be excessively expensive. Instead, the virtual memory organisation uses a folio table to perform address translation. A folio table contains an entry for each virtual page, indicating its location in physical memory or that information technology is on the difficult bulldoze. Each load or store instruction requires a page table access followed by a physical memory access. The folio table access translates the virtual address used past the program to a physical address. The physical address is then used to actually read or write the information.

The page table is ordinarily so large that it is located in physical retention. Hence, each load or store involves two physical memory accesses: a page table access and a data access. To speed up address translation, a translation lookaside buffer (TLB) caches the virtually commonly used page table entries.

The remainder of this section elaborates on accost translation, page tables, and TLBs.

eight.4.i Address Translation

In a system with virtual memory, programs use virtual addresses so that they tin can admission a large retentiveness. The computer must interpret these virtual addresses to either find the accost in concrete retention or take a page mistake and fetch the data from the hard drive.

Recall that virtual memory and physical memory are divided into pages. The most meaning bits of the virtual or physical address specify the virtual or physical page number. The to the lowest degree meaning bits specify the give-and-take inside the folio and are called the folio get-go.

Figure 8.21 illustrates the page system of a virtual memory organisation with 2 GiB of virtual memory and 128 MiB of physical retentiveness divided into four KiB pages. MIPS accommodates 32-fleck addresses. With a ii GiB = 231-byte virtual memory, only the to the lowest degree significant 31 virtual address bits are used; the 32nd bit is always 0. Similarly, with a 128 MiB = ii27-byte concrete retention, only the least significant 27 physical address $.25 are used; the upper 5 $.25 are always 0.

Effigy 8.21. Physical and virtual pages

Considering the page size is 4 KiB = two12 bytes, there are two31/212 = 219 virtual pages and 227/212 = 215 concrete pages. Thus, the virtual and physical page numbers are nineteen and 15 bits, respectively. Physical memory tin can just hold up to one/16th of the virtual pages at any given time. The rest of the virtual pages are kept on the difficult drive.

Figure 8.21 shows virtual folio 5 mapping to physical page 1, virtual page 0x7FFFC mapping to concrete page 0x7FFE, and and so along. For instance, virtual address 0x53F8 (an offset of 0x3F8 inside virtual folio 5) maps to physical address 0x13F8 (an start of 0x3F8 inside physical page 1). The least significant 12 bits of the virtual and physical addresses are the same (0x3F8) and specify the page offset within the virtual and physical pages. Merely the page number needs to exist translated to obtain the physical address from the virtual address.

Figure eight.22 illustrates the translation of a virtual address to a physical address. The least significant 12 bits indicate the folio commencement and require no translation. The upper nineteen bits of the virtual address specify the virtual page number (VPN) and are translated to a 15-scrap concrete page number (PPN). The next two sections draw how page tables and TLBs are used to perform this address translation.

Figure 8.22. Translation from virtual accost to physical address

Example 8.xiii

Virtual Address to Physical Accost Translation

Observe the physical address of virtual address 0x247C using the virtual memory organization shown in Figure eight.21.

Solution

The 12-scrap page start (0x47C) requires no translation. The remaining 19 bits of the virtual accost give the virtual page number, so virtual address 0x247C is institute in virtual page 0x2. In Figure eight.21, virtual folio 0x2 maps to physical page 0x7FFF. Thus, virtual address 0x247C maps to physical address 0x7FFF47C.

8.4.2 The Page Tabular array

The processor uses a page table to translate virtual addresses to physical addresses. The page table contains an entry for each virtual folio. This entry contains a physical page number and a valid bit. If the valid scrap is i, the virtual page maps to the physical page specified in the entry. Otherwise, the virtual page is found on the hard bulldoze.

Because the page table is and so big, information technology is stored in physical memory. Let us assume for now that it is stored as a face-to-face assortment, as shown in Figure eight.23. This page table contains the mapping of the memory system of Figure 8.21. The page table is indexed with the virtual folio number (VPN). For example, entry 5 specifies that virtual page 5 maps to physical page 1. Entry 6 is invalid (5 = 0), so virtual page half-dozen is located on the hard bulldoze.

Figure 8.23. The page table for Figure viii.21

Example eight.14

Using the Page Tabular array to Perform Address Translation

Find the physical accost of virtual address 0x247C using the page tabular array shown in Effigy 8.23.

Solution

Figure 8.24 shows the virtual accost to physical address translation for virtual address 0x247C. The 12-chip page starting time requires no translation. The remaining nineteen bits of the virtual address are the virtual page number, 0x2, and requite the alphabetize into the page table. The folio table maps virtual page 0x2 to physical page 0x7FFF. So, virtual address 0x247C maps to physical address 0x7FFF47C. The least significant 12 bits are the aforementioned in both the concrete and the virtual address.

Effigy viii.24. Address translation using the folio table

The folio table can be stored anywhere in physical memory at the discretion of the Os. The processor typically uses a dedicated register, called the page tabular array annals, to store the base address of the page tabular array in physical memory.

To perform a load or store, the processor must first translate the virtual address to a concrete address and then access the data at that physical address. The processor extracts the virtual page number from the virtual address and adds it to the page table register to find the physical accost of the folio table entry. The processor so reads this page table entry from concrete retention to obtain the physical page number. If the entry is valid, it merges this concrete page number with the folio offset to create the concrete address. Finally, it reads or writes data at this physical accost. Because the page tabular array is stored in physical memory, each load or store involves two physical memory accesses.

eight.4.iii The Translation Lookaside Buffer

Virtual retentivity would accept a severe performance bear upon if it required a page table read on every load or store, doubling the filibuster of loads and stores. Fortunately, page table accesses have swell temporal locality. The temporal and spatial locality of data accesses and the large page size mean that many sequent loads or stores are probable to reference the same folio. Therefore, if the processor remembers the last page table entry that it read, it can probably reuse this translation without rereading the page table. In full general, the processor can go along the last several page table entries in a small enshroud called a translation lookaside buffer (TLB). The processor "looks aside" to observe the translation in the TLB before having to admission the page table in concrete memory. In real programs, the vast majority of accesses hitting in the TLB, fugitive the time-consuming page table reads from physical memory.

A TLB is organized every bit a fully associative enshroud and typically holds 16 to 512 entries. Each TLB entry holds a virtual page number and its corresponding physical page number. The TLB is accessed using the virtual page number. If the TLB hits, information technology returns the corresponding physical page number. Otherwise, the processor must read the page table in concrete retentiveness. The TLB is designed to be modest plenty that it can be accessed in less than i cycle. Even so, TLBs typically have a hitting rate of greater than 99%. The TLB decreases the number of memory accesses required for most load or store instructions from ii to one.

Instance 8.15

Using the TLB to Perform Address Translation

Consider the virtual memory system of Figure 8.21. Use a ii-entry TLB or explain why a folio table access is necessary to translate virtual addresses 0x247C and 0x5FB0 to physical addresses. Suppose that the TLB currently holds valid translations of virtual pages 0x2 and 0x7FFFD.

Solution

Effigy 8.25 shows the two-entry TLB with the asking for virtual address 0x247C. The TLB receives the virtual folio number of the incoming address, 0x2, and compares information technology to the virtual page number of each entry. Entry 0 matches and is valid, and so the asking hits. The translated concrete accost is the physical page number of the matching entry, 0x7FFF, concatenated with the page kickoff of the virtual accost. As always, the folio beginning requires no translation.

Figure 8.25. Accost translation using a two-entry TLB

The request for virtual address 0x5FB0 misses in the TLB. Then, the request is forwarded to the page tabular array for translation.

viii.4.4 Memory Protection

And so far, this section has focused on using virtual retentiveness to provide a fast, inexpensive, large retentivity. An every bit important reason to use virtual memory is to provide protection between concurrently running programs.

Equally you probably know, mod computers typically run several programs or processes at the same time. All of the programs are simultaneously nowadays in physical memory. In a well-designed computer organisation, the programs should be protected from each other so that no programme tin crash or hijack some other program. Specifically, no program should exist able to access some other program's memory without permission. This is called retention protection.

Virtual memory systems provide memory protection by giving each program its own virtual address space. Each program tin can use every bit much retentiveness as information technology wants in that virtual address space, only just a portion of the virtual address space is in physical memory at whatsoever given fourth dimension. Each program tin use its entire virtual address space without having to worry nearly where other programs are physically located. However, a program can access simply those physical pages that are mapped in its page table. In this mode, a program cannot accidentally or maliciously admission another program'due south physical pages because they are not mapped in its folio tabular array. In some cases, multiple programs admission common instructions or data. The OS adds control bits to each page tabular array entry to determine which programs, if any, can write to the shared physical pages.

8.4.5 Replacement Policies*

Virtual retention systems utilize write-dorsum and an estimate least recently used (LRU) replacement policy. A write-through policy, where each write to physical memory initiates a write to the hard bulldoze, would be impractical. Store instructions would operate at the speed of the hard drive instead of the speed of the processor (milliseconds instead of nanoseconds). Nether the writeback policy, the concrete page is written back to the hard drive only when it is evicted from physical retentiveness. Writing the concrete folio back to the hard drive and reloading it with a dissimilar virtual folio is chosen paging, and the hard drive in a virtual memory organisation is sometimes called swap space. The processor pages out i of the to the lowest degree recently used physical pages when a page mistake occurs, and so replaces that folio with the missing virtual folio. To support these replacement policies, each page table entry contains two additional status bits: a muddied bit D and a use scrap U.

The dingy bit is i if any store instructions have inverse the physical page since it was read from the hard drive. When a physical page is paged out, information technology needs to be written back to the hard drive just if its muddied bit is 1; otherwise, the hard drive already holds an verbal re-create of the page.

The utilize bit is 1 if the physical page has been accessed recently. As in a cache organization, exact LRU replacement would be impractically complicated. Instead, the Os approximates LRU replacement by periodically resetting all of the employ bits in the page table. When a folio is accessed, its utilise fleck is set up to i. Upon a folio mistake, the Os finds a page with U = 0 to page out of physical retentiveness. Thus, it does not necessarily replace the least recently used page, just one of the least recently used pages.

8.4.6 Multilevel Folio Tables*

Page tables can occupy a large amount of physical retention. For instance, the folio table from the previous sections for a 2 GiB virtual memory with 4 KiB pages would demand 219 entries. If each entry is 4 bytes, the page table is two19 × 2ii bytes = 221 bytes = 2   MiB.

To conserve concrete memory, page tables tin can be broken up into multiple (normally two) levels. The starting time-level page table is always kept in concrete memory. It indicates where small second-level page tables are stored in virtual memory. The 2nd-level page tables each contain the actual translations for a range of virtual pages. If a particular range of translations is non actively used, the corresponding second-level page table can be paged out to the hard drive so it does not waste matter physical memory.

In a two-level folio tabular array, the virtual folio number is split up into ii parts: the page tabular array number and the folio table offset, every bit shown in Figure 8.26. The page table number indexes the first-level page tabular array, which must reside in physical retentiveness. The first-level page table entry gives the base of operations address of the second-level page table or indicates that it must be fetched from the hard drive when V is 0. The page table offset indexes the 2d-level page table. The remaining 12 bits of the virtual address are the folio first, as before, for a page size of 212 = 4 KiB.

Effigy 8.26. Hierarchical folio tables

In Figure viii.26, the 19-chip virtual page number is broken into nine and ten bits to bespeak the page table number and the page tabular array beginning, respectively. Thus, the get-go-level page table has 2ix = 512 entries. Each of these 512 second-level page tables has 210 = 1 Ki entries. If each of the start- and second-level folio tabular array entries is 32 bits (4 bytes) and only two second-level page tables are present in physical memory at once, the hierarchical page table uses just (512 × 4 bytes) + 2 × (1 Ki × 4 bytes) = 10 KiB of physical memory. The two-level folio tabular array requires a fraction of the physical memory needed to store the entire page table (ii MiB). The drawback of a 2-level page table is that information technology adds notwithstanding another retentiveness access for translation when the TLB misses.

Example 8.16

Using a Multilevel Page Table for Address Translation

Figure 8.27 shows the possible contents of the two-level page table from Effigy 8.26. The contents of merely one second-level folio tabular array are shown. Using this two-level page tabular array, describe what happens on an access to virtual address 0x003FEFB0.

Effigy 8.27. Address translation using a two-level page table

Solution

As always, merely the virtual page number requires translation. The most significant nine bits of the virtual address, 0x0, give the page table number, the index into the first-level page table. The commencement-level page table at entry 0x0 indicates that the second-level page table is resident in memory (V = i) and its physical address is 0x2375000.

The next ten bits of the virtual accost, 0x3FE, are the folio table offset, which gives the alphabetize into the second-level page tabular array. Entry 0 is at the bottom of the 2d-level page tabular array, and entry 0x3FF is at the summit. Entry 0x3FE in the second-level page table indicates that the virtual folio is resident in physical retentiveness (V = 1) and that the concrete folio number is 0x23F1. The physical page number is concatenated with the page kickoff to form the concrete address, 0x23F1FB0.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128200643000088