Session 21: Virtual memory

Textbook: Sections 4.3 and 4.4

Paging
  basic idea
  page table format
Implementation issues
  large tables: table directories
  large tables: inverted page tables
  speed: the TLB
Replacement algorithms
  FIFO (First in, first out)
  LRU (Least recently used)
  NRU (Not recently used)
  Clock
  LFU (Least frequently used)
  MFU (Most frequently used)

Paging

Basic idea

Paging is like swapping, but it incorporates one basic observation: Over a short period of time, a process rarely needs all its memory.

So it makes sense to divide a process's memory up into pieces, and only load in a piece when it's necessary. We'll call each piece a page - a typical page size is 1K.

It's easiest to understand if you think of memory as being much larger space than it actually is - that is, if you think of there being a virtual address space. Even if our computer has only 64MB, we might have a virtual address space of 4GB. Actually, for illustration purposes, we'll think of having 8KB memory with a 16KB virtual address space.

Memory will be divided into page frames: In our 8KB system, with 1KB pages, there are 8 page frames where pages might reside. At any time, each page frame will contain an actual page of virtual memory.

frame  page
  7      5
  6      0
  5     15
  4     14
  3     11
  2      9
  1      3
  0      6
The system will maintain a page table - an array that maps pages to page frames.
page frame
 15    5
 14    4
 13    X
 12    X
 11    3
 10    X
  9    2
  8    X
  7    X
  6    0
  5    7
  4    X
  3    1
  2    X
  1    X
  0    6

Between the CPU and the memory lies a memory management unit (MMU), which actually handles all memory requests. When the MMU receives a memory request, it breaks it up into two pieces - the page and the offset. Our computer has a virtual memory of 214 bytes, so memory addresses are 14 bits long. The first four bits give the page number, and the last 10 bits give where in the page the request lies.

PAGE __OFFSET__
4bit __10_bits_
Let's say the CPU requests memory address 10010001000010. The first four bits request page 9, and the last ten bits request offset 66 within the page. The MMU looks up page 9 in the page table, and finds that it's located in page frame 2. It places 2 before the offset to get the actual memory address, 0100001000010. This is what it forwards onto the memory to handle.

If the MMU finds an X in that slot of the page table, it causes a page fault - it sends an interrupt to the OS, telling it that it needs to load in the page. The OS selects a page frame, saves the page at that location onto disk if necessary, loads the page from disk into memory, and updates the page table to reflect the new data.

Page table format

A typical page table entry will include the following:

Implementation issues

Implementing paging brings up three problems.

Luckily, these can be circumvented with effort. The first isn't all that big of a deal - the MMU is just incorporated into the CPU chip, end of story. The other two are more interesting.

Large tables: Table directories

To get around the hugeness of the page tables, a common solution is to add another level of indirection - the page directory. Now each memory address has three pieces.

__TABLE___ ___PAGE___ ___OFFSET___
_12_bits__ _10_bits__ __10_bits___
Say the memory address's three pieces are T, P, and O. First the MMU looks at element T of the page directory, which specifies where the page table is. Then it looks at element P of this page table to determine the memory address where the page starts. Finally, it looks at offset O in the page to get the actual information.

The page directory has two virtues. First, it means that unused portions of the virtual address space do not need to be listed in a page table, since that page table can simply be omitted from the page directory.

Second, it means that the individual page directories might actually be paged themselves, so that page tables for portions of the virtual address space that are used infrequently can be stored on disk.

The downside is that it adds even more to the speed problem: We've added yet another memory access, cutting memory access time to a third of what it would be without paging.

Large tables: Inverted page tables

An alternative is to invert the hash table. The idea here is that we only care about the table entries that represent pages actually in memory. The ones that aren't in memory aren't a big deal.

So the page table can actually just be a hash table, mapping page indices to page table entries. If the page isn't currently loaded, it can simply be omitted from the hash table. Suddenly, the page table only has to have an entry for each page frame, not an entry for each page.

The biggest downside to hash tables is that accessing one can involve several memory references, not just one. So it slows down the memory reference even more.

There's a theme here: The bigger your address space, the more memory references you're going to need to map it into the relatively small actual memory space.

Speed: The TLB

The only way to resolve the speed issue is to move the page table onto the MMU. But the large size of the page table makes this problematic.

The solution is to store only a portion of the page table on the MMU - in an area called the translation lookaside buffer (TLB). It's basically a cache for the page table. It can only hold a portion of the page table - typically, just 16 entries.

Each TLB entry has two pieces: The virtual page number, and the page table entry. When the MMU wants to get a page table entry, it looks in the TLB and sees if any of them match the page requested. If it does, then it already has the page table entry. If not, then it loads the page table entry from memory.

Say a memory access takes 100 ns. Without a TLB, each memory access requires 200 ns - one to read from the page table, and one to read from the actual memory location. (We're neglecting page faults for the purposes of tis example.) But with a TLB, the hardware can check all the TLB entries simultaneously without a problem, resulting in effectively immediate response with the page table entry. When the TLB contains the page table entry, the memory access takes just 100 ns.

A typical TLB will hold 64 entries at any time and might get a 90% success rate. With this success rate, the average time per memory access is 0.9*100 + 0.1*200 = 110 ns. So the program experiences just a 10% slowdown against using raw memory. That's a much more manageable price to pay for the increased convenience of virtual memory.

Replacement algorithms

In implementing paging, it's quite important that the OS choose a good strategy for deciding which pages to keep in memory. (The same issue arises for deciding what to keep in the TLB.) It's important that the algorithm both avoid page faults, and that it be very fast. You don't want an algorithm that requires you to search the entire page table very frequently. There are a lot of algorithms people have investigated for this.

FIFO (First in, first out)

The simplest is FIFO: The algorithm always ejects the page that was least recently loaded.

This algorithm has two things going for it: First, each page gets its fair share of time in memory. Second, it's quite easy to implement: The OS just has to maintain a counter of the last page ejected. When the OS gets a page fault, it can simply increment the counter and then eject the page that the counter specifies (wrapping around if necessary). It doesn't get any more efficient than that.

FIFO does a credible job. But it's not terrific. One of the weird things about FIFO is that its effectiveness sometimes degrades with increased memory. Consider the following sequence of page references: 1,2,3,4,1,2,5,1,2,3,4,5. With three page frames, FIFO experiences nine page faults. With four page frames, FIFO experiences ten.

LRU (Least recently used)

LRU says that the algorithm should always eject the page that was least recently used. The idea is that a page that has been frequently used recently is likely to be used again in the future.

LRU has about the best performance anybody has found in a simple algorithm. The only reason it's not used universally is its implementation. People have two ways of doing it.

The counter implementation is the simpler one. It's also more efficient in handling individual page accesses. The linked list handles page faults more efficiently - but page faults are supposed to be extremely rare, and anyway the CPU's going to have to work awfully hard to be comparable to the time spent loading the page from the disk.

But neither of these approaches is very efficient. That's why people look for alternatives.

NRU (Not recently used)

This goes back to FIFO. But it doesn't require any additional data beyond the elementary page table entry, which keeps both the dirty bit and the referenced bit for each page frame.

When a page fault occurs, NRU works as follows: It picks randomly from the pages that have neither the dirty bit nor the referenced bit set. If there aren't any, it selects from the pages for which the referenced bit is zero. If there still aren't any, it selects from the pages for which the dirty bit is zero, and if there still aren't any, it just chooses randomly from all the loaded pages (all of which have both the dirty bit and the referenced bit set).

The idea of this algorithm is to prefer to select pages that haven't been accessed recently. And, if we can't distinguish pages based on the recency of their access, we should prefer to eject clean pages, since ejecting them doesn't entail copying them onto disk first.

While NRU isn't all that great as far as avoiding page faults, it's simple and requires minimal support for each page access.

Clock

The clock algorithm also works only with the existing bits. But it's not random, and it doesn't use the dirty bit. It only looks at the referenced bit.

Actually, it's just like FIFO, except that the algorithm never ejects somebody whose reference bit is set. Instead, it clears the reference bit and goes on to check out the next page frame.

Clock is only slightly more complex than FIFO, but it gives some approximation to LRU with, again, only minimal support for each page access.

LFU (Least frequently used)

With LFU, each page table entry has a counter, and for each memory reference the MMU increments that counter. When a page fault occurs, the OS should choose the page frame whose counter is smallest.

LFU is quite bad in this simplest form. The problem is that a page can be accessed very heavily for one second, and never at all for the next ten seconds. But the count gets so high that the page isn't ever ejected.

So what people do instead is to right-shift the counter every occassionally (like every 100 ms), so that the counter eventually decays back to 0 if it's not used. This works much better.

MFU (Most frequently used)

The MFU algorithm comes from the same sort of reasoning that brought up the worst-fit memory allocation scheme: Why not do the exact opposite? In this case, ejecting the page frame whose frequency counter is largest actually makes some sense: If a page's frequency counter is large, then that page has had its fair share of the memory, and it's time for somebody else to get its turn.

Unfortunately, it doesn't work.