Linux vm readme


MM
---
The definition of the struct mm_struct is in linux/sched.h.

A mm structure defines an address space. Each process or clone sharing
an address space has a pointer to an MM from its task structure. The
number of tasks sharing an address space is maintained in "count",
which is atomically manipulated by mmget/mmput.

The "mmap" field is a list of vma's, each of which tracks an allocated
address range in the address space. The vma's are kept in increasing
address order, chained via vm_next. When the number of vma's gets to 
be more than AVL_MIN_MAP_COUNT, a parallel avl tree is created out
of them for faster searching. This tree is rooted at "mmap_avl". The
"mmap_cache" keeps a pointer to the last vma found, and cuts down
on tree/list searches due to locality of reference. The number of vma's
are in "map_count". The "mmap_sem" guards addition/deletion of vma's
and changes to any fields in any vma in the list. Note that the code
guards against having more than MAX_MAP_COUNT vma's in a space.

The "pgd" is a kernel pointer into the page directory. Often, this is
just a software structure (as in the MIPS, where tlb's are updated
and the hardware does not really look at the page directory), and often
it is viewed by the hardware (as in ia32).

The "context" field is a processor mmu field, which is used in some
processors (MIPS, sparc), but not on others. Similarly, the "segments"
field is also a processor specific field, which is used just on ia32
to track the LDT. The "cpu_vm_mask" is also just used on the sparcs.

Currently, the "def_flags" has only one possible value, ie VM_LOCKED,
and indicates whether future-locking is enable on the address space.
The "locked_vm" tracks the total number of locked pages (sum of pages
in locked vma's, somewhat different for the stack) in the address
space for accounting/reservation reasons. The "total_vm" tracks the
total address space summed over all the vma's. The "rss" tracks the
number of incore pages belonging to the address space.

The "swap_address" and "swap_cnt" fields are used to determine how
many pages to steal, and from which address range, when memory runs
low. The "swap_address" is maintained as a rotor. The "swap_cnt" field
is used to search for the best process to steal pages from depending
on the process' address space "rss".

The "arg_start", "arg_end", "env_start" and "env_end" are no misnomers,
and maintain the user's virtual address to the argv[] and env[] for the 
executing image on its stack, and can be queried via procfs. The fields
"start_code", "end_code", "start_data" and "end_data" are similar. These 
are set up at exec time, when the kernel creates the stack for the program
and populates it with the argv/env arrays. The "end_code" field is used
in the brk() system call to make sure that the break area does not run
into code, and that the data is within rlimits boundary. The "start_stack"
is also set at exec time, and is only used by shared memory to make
sure some growing place is left for the stack (WHY NOT FOR OTHER MMAPS?)
"start_brk" is also set at exec time, but the "brk" value changes as
the program issues brk system call to get more data space.

VMA
----
The definition of a vma is in linux/mm.h.

Each vma tracks an address range allocated to the address space. The 
address range is maintained in "vm_start" and "vm_end".

The "vm_mm" points back into the owning memory manager (ie, address space).
It is mostly used to update address space specific fields, like "rss",
"locked_vm" and "total_vm", as well as locate the page directory.

The "vm_next" field is used to track the vma's in increasing order of
address ranges in a singly linked list. This list is used for searching the
vma corresponding to any given range ... to decrease search time, when the
list grows to have more than AVL_MIN_MAP_COUNT vma's, a parallel avl tree
is also maintained via the "vm_avl_left", "vm_avl_right" and "vm_avl_height"
fields.

The "vm_next_share" and "vm_pprev_share" help in managing a doubly linked
list of vma's. The shm code uses this to track all the vma's that map a
single shared object (although this is only recreational, not productive).
If the underlying object is a file/device, the mmap code tracks all the 
vma's mapping that object via these fields - this is used if the file is
being truncated to visit all the vma's and shoot down the affected pages
in them.

When the vma represents a file/device, the "vm_file" is a pointer to the
file data structure. The "vm_offset" is the offset into the file where the
mapping starts. To preserve the underlying file structure, do_mmap grabs
a reference on the file, so that vm_file points to a sane structure.

The "vm_pte" is only used when the vma has an underlying shm object, and
the shm code uses it to track the shm id that the vma is mapping. This is
used when a no-page fault happens, so that the shm code can go look at the
proper data structures for the shm id and update the pte accordingly.

The "vm_ops" field is a pointer to a set of operations defined by the
manager of the underlying object. For example, the vm_ops for a shm area,
for a shared file mapping and a provate file mapping all are different. 
Drivers are also free to define their own operations when a vma maps
the corresponding device or driver memory.

"vm_page_prot" is always protection_map[vm_flags & 0xf]. The lower 4
bits of vm_flags maintain the VM_READ/WRITE/EXEC (ie PROT_READ/WRITE/
EXEC) permissions of the mapping and the VM_SHARED flag, which indicates
whether the address range shares its changes with the underlying
object. "vm_page_prot" is used to set up the pte protections while 
updating the pte's. The protection_map[] defines what bits are put
in the pte corresponding to shared and private mappings with the
specified rwx privileges.

The flags VM_MAYREAD and VM_MAYEXEC track what maximum protections
are available via mprotect - they are set on all mappings except
for MAP_SHARED writeonly file mappings (note that MAP_PRIVATE mappings
are not allowed to nonreadable files). The VM_MAYWRITE flag is turned 
on for all non-file mappings, for MAP_PRIVATE file mappings, as 
well as for MAP_SHARED mappings on files opened for write, but not 
on MAP_SHARED (readonly) mappings on files opened readonly. It 
is a scheme to demote MAP_SHARED readonly mappings to MAP_PRIVATE 
readonly mappings. The VM_SHARED mappings share any changes in the
mm context with the underlying object, and are set by MAP_SHARED 
mappings (except the case for readonly MAP_SHARED mappings that 
will never be promoted to have write perms). The VM_MAYSHARE flag
indicates that the mapping was created by a MAP_SHARED mapping,
although the mm might never be able to change any of the pages
of the underlying object. VM_DENYWRITE mappings are normally
mappings set up by the kernel internally, as during mapping a.out
(or libraries and interpreters) and is an indicator that the 
underlying file can not be modified while the mapping is around. 
VM_LOCKED indicates that the pages are locked and can not be 
stolen. VM_EXECUTABLE indicates that the mapping is for the 
executable part of a.out.

The VM_LOCKED flag indicates that the pages of the vma have been faulted
and locked into memory, and none of these can be stolen when memory runs
low on the system. The VM_GROWSDOWN flag is only for use by the vma managing
the stack, and indicates that growing the vma decreases "vm_start" (instead 
of increasing "vm_end"). The counterflag, VM_GROWSUP, does not seem used.

The VM_IO flag denotes that the vma is ampping an io device, which should
not be dumped if the program core dumps. At least on the m68k, it also 
seems that these vma's can not incure page faults (meaning the drivers
must validate the pte's at mapping time).

The VM_SHM flag is turned on by shm code and by drivers. It indicates that
a vma can not be merged with vma's mapping neighbouring address ranges,
unless the offsets line up properly (as during merging vma's of file 
mappings.


PAGES
-----
The definition of a page is in linux/mm.h.

Page usage in the system:

include/linux/mm.h has a bunch of comments about how pages are maintained
and the meanings of the various bits in the "flags". Essentially, each 
page in the system is maintained via a "struct page". Absent (or 
inaccessible) pages are marked PG_reserved (for example, on the ia32,
0x9F000 ... really 0xA0000, to 1Mb is reserved for the video memory).
These pages are then not allocated by the os for any data.

All other pages in the system are usable. Drivers/slabcode/filesystems
can grab a page, and then maintain/alter the different fields in the
page structure. Another user of pages is the buffer system, which divvies
up a physical page into multiple buffer-data area - these pages have
their "buffers" field pointing to a chain of the headers of the buffers
whose data areas are within the page. Another significant user of pages 
is the shared memory subsystem. At low memory conditions, it is possible 
to reclaim some/all of these pages (shm_swap, kmem_cache_reap etc).
The PG_Slab bit is used to indicate the case when a page is allocates to
the slab cache, and is used mostly for sanity purposes in the slab cache
allocation/free code.

By far, the most complicated uses of a page are by the filesystem code
that cache file data in the pages. And by the swap code that caches 
disk-copies of data that it has swapped into/outof core. 

For information on how the shm code, file data code and swap code manage
these pages and the fields in the structure, look into the corresponding 
sections. 

Page free list management:

All allocatable, free pages in the system are put in the free list. The
free list is managed as an array of free-lists, each of which keeps track
of 1, 2, 4, 8 ... contiguous free pages. Whenever a page is freed, a
buddy algoritm is invoked to coalesce it into the biggest possible free
large page. Page allocation and freeing always add the (large) page into
the head of the list. 

Note that free_area[i] will track (via the page "next"/"prev" pointers)
any pfn that is a multiple of 2^i, if consecutive 2^i pages starting from 
that pfn are free. Also, this page will not appear in the lower order 
freelists, ie, even though pfns 8 .. 16 might be free, pfn 8 will appear
only in free_area[3], not in the lower order ones. To implement the buddy 
algorithm, free_area[i] has a bitmap that indicates whether each of the
possible pfns that can be present in this list are actually present in
this list or not. In the previous example for free pfns 8 .. 16, only
pfn 8 will be present in free_area[3] with defined "next"/"prev" pointers,
all the other pages will not be in any free list, and will have undefined
"next"/"prev" pointers.

Page allocation and freeing are protected by a single spinlock
page_alloc_lock.

Page fields:

When a page is on the free list, the "next"/"prev" pointers are used
to link free pages together. 

When a page belongs to a file, the "inode" field is set correspondingly.
Note that for anonymous pages that are being stolen by kswapd, the "inode"
is set to "swapper_inode", which is a pseudo file that maintains pages
that have unmodified disk copies of swap. This association might also be
temporary (as used by shm). In either case, the "next/prev" pointers form
a list of pages belonging to the inode (except for the temporary shm case).
In the case of files, the "offset" if the offset into the file whose data
is present in the page; for the swap case, the "offset" is a swap handle
which identifies where on the swap device the page has been copied to.
The "next_hash"/"pprev_hash" form a has list on the (inode/offset) in either
case (except temporary shm).

The "wait" field is used to track tasks waiting for a page to become 
unlocked, ie, for io to complete on the page.

If the page is allocate to the buffer subsystem, the "buffers" points
to a list of buffer headers, each of whose data area is in the page. At
least, shrink_mmap uses this to steal buffer pages and to call into the
buffer code to clean up the association.

The "count" field manages the number of references to the page. These 
consist of user references (since there's no shared page tables, that 
would be the number of pte's that point to the page), as well as kernel
references (as in the hold obtained on a page to put it in the file/swap
cache). 

Page flags:

PG_locked indicates that io has been scheduled on a page. After the io
(sync or async) is completed (or can not be done due to buffer header 
scarcity), the buffer routines clear the bit. Places which set the 
bit are:
1. when a page is being stolen from a process, and it is being put
in the swapcache and rw_swap_page is being invoked on it. 
2. The shm code sets the bit when is is swapping in/out a page via 
rw_swap_page_nocache.
3. Async swapping via read_swap_cache_async sets the bit. 
4. generic_readpage, which is the readpage operation to read in a page
of a file.
5. generic_file_write sets the bit on the page it is going to be writing
to. 
When a page is locked, it can not be stolen. For inode pages, locked
pages can not be removed from the cache (invalidate_inode_pages), and
truncation has to wait for the pages to become unlocked. Locked pages
have to be waited for, before they can be returned via a page/swap cache
search.

PG_skip is a sparc processor specific flag.

The PG_Slab bit is used to indicate the case when a page is allocated to
the slab cache, and is used mostly for sanity purposes in the slab cache
allocation/free code.

The PG_error is cleared before the buffer routines start io on a page
and might be set at the end of async io (EXACTLY WHEN?) and is cleared
whenever a page is put in the file cache (not the swapcache though).
This bit is checked in the file read case to make sure an io error did
not occur.

The PG_swap_cache bit indicates that a page is in the swapcache, ie, it
is unmodified with respect to its copy on the swap device. Whenever an
anonymous page is being stolen by kswapd, and whenever such a page is
swapped in, the page is put in the swapcache (temporary shm case). For
such pages, the PG_swap_unlock_after will also be set right before
io is started on the page, so that the io done routines can invoke
swap_after_unlock_page (to unlock the swap pages) [[ rw_swap_page_base
sets this bit if PG_swap_cache is set, but all pages coming into
rw_swap_page_base must have PG_swap_cache set, so why the checking? ]]

The PG_free_after bit is used to indicate to the buffer iodone routine
that the page needs to be freed. Sync/async swap and generic_readpage
set this bit before invoking the read/write brw_page routine, since both
cases grab a page reference before the io starts. The PG_decr_after bit
is similar, and is used to accurately maintain the number of async swap
io's in flight in "nr_async_pages".

The PG_referenced bit is used to decide whether to steal a page or not
and indicates whether a physical page has been accessed recently or not.
Pages in the file/swap cache are marked PG_referenced when a search for
them is made in the cache, so that shrink_mmap leaves these pages alone.
The buffer subsystem also marks some dirty buffers PG_referenced so that
the pages containing the data area are not freed up by shrink_mmap. Also,
when kswapd encounters a young pte, it turns on the PG_referenced bit as
a later hint to shrink_mmap to leave the page alone.

The PG_uptodate bit is set by the buffer iodone routines when an io 
completes (both read and write) successfully. Though this bit is set/
cleared on both swapcache and filecache pages, it is only ever checked
on file pages in the "nopage" and "read" paths to make sure that valid
contents were read in fron the disk.

The PG_DMA bit indicates whether the page is in a physical address range
that all devices can do dma to. This is used in the page stealing/kswapd
path to determine whether to steal a page depending on whether there is
need for dmaable memory or regular memory.

FILE CACHE AND FILE/VM INTERACTIONS
------------------------------------
Each inode has a list of vma's mapping the file in "i_mmap". This is
used if the file is being truncated to visit all the vma's and shoot 
down the affected pages in them.

The inode also has a pagecache "i_pages", # pages in the list in
"i_nrpages", the pages being chained via the "next/prev" fields.
These pages are also in a hash list (page_hash_table) linked
by the "pprev_hash/next_hash" fields. Each page has a pointer to
the inode it belongs to in the "inode" field, and an "offset"
into the file. Just because a page is in the pagecache does not mean
it has the right data ... it may be PG_locked, indicating io is on
progress to the page, or it may not have PG_uptodate data. 
When a page is put in the cache, its PG_error, PG_referenced and PG_uptodate 
flags are cleared. The filecache establishes a reference count on
the page structure.

Things which add pages into the hash Q are filemap_nopage
when it has to allocate a page and read in file contents, try_to_read_ahead
when it does not get a cache hit and wants to use any passed in page for
readahead, generic_file_write/do_generic_file_read when it has to allocate
pages to satisfy the user request, and when get_cached_page allocates a
page for the file.

Things which delete pages from the file cache are 
1. invalidate_inode_pages, which funnily, is not coming out of msync 
MS_INVALIDATE, which seems to be *not* deleting the page from the cache.
2. truncate_inode_pages, which is called when an inode is being deleted by
the filesystem, and to handle truncate() system call.
3. remove_inode_page, when shrink_mmap releases a file page, and when a page
is removed from the swapcache.

SWAP CACHE
----------
The swap cache maintains a pesudo-filecache of pages that are clean with
respect to their copies on the swap device. The kernel uses a pseudo inode
"swapper_inode" to associate these pages with. The swapcache establishes a 
reference count on the page structure (using the underlying filecache
routines). Three things indicate that a page is in the swap cache - the
page's inode points to swapper_inode, the page has the PG_swap_cache flag
set, and the page is in the swapper_inode cache/hashQ (temporary shm 
swapcache pages do not satisfy the last criteria).

A page is added to the swapcache/hash Q by 
1.read_swap_cache_async/swapin_readahead/swap_in when a swap page is read 
in from disk. The page is added to the cache before the io is started,
hence though the page is in the SwapCache, it may not have the uptodate
contents. This is anyway indicated by setting of the PG_locked bit, which
is cleared after the io completes - this is the synchronization between 
read_swap_cache_async reading the page and lookup_swap_cache finding it. 
[[ When rw_swap_page_base starts to read
a page from disk, it clears PG_uptodate to indicate the page contents are
not valid (though this PG_uptodate clearing is done again in brw_page).
SwapCache pages do not need to look at this bit. Why is it being cleared? ]]
2. in try_to_swap_out by kswapd/pagestealers on non-file and non-shm
pages while trying to free pages.

A page is removed from the swapcache/hash Q by 
remove_from_swap_cache/delete_from_swap_cache when
1. shrink_mmap finds a page in the swapcache that has a reference only
from the swapcache, meaning that the page has already been stolen and
written out to disk (note that after_unlock_page drops the ref count on
a page after io is done to counter the ref inc that was done in 
rw_swap_page_base).
2. do_wp_page, when we are grabbing a page to write to it, so that the
old copy on swap is invalidated.
3. swap_in, when we know the page we swapped in was dirtied hence we are
setting write perms on the page instead of setting read perms and taking
a fault on write whence we remove it from the swapcache.
4. free_page_and_swap_cache, when a page is being freed, and no one other
than the swapcache has a reference to it (except ongoing io). For example,
while we are trying to swap_in, and someone else (like a clone member or
the debugger) has already done it for us. Or when we are freeing a range
of user pages (unmap/exit/remap). Or in put_page, when we have handled a 
not present page fault, identify the page to drop into the pte, but find
that it has already been dropped in while we were sleeping identifying/
initing the page that we now hold.
5. try_to_unuse, when we are trying to delete a swap device.

When a page is in the swapcache/hash Q, its "offset" field tracks the swap
handle which identifies where on the swap device it is. When the page
is stolen, this swapentry is put in the pte, so that the process can
do a do_swap_page on a "page-not-present" fault. Note that the SwapCache
is searched for by the same underlying routines that do PageCache 
searching (__find_page), hence when a swap_in happens, the code ends
up setting PG_referenced, waiting for PG_locked to go off [[ but not
PG_uptodate? ]]

Each page in the swap cache has an associated swap handle, which was
obtained by a call to get_swap_page(). Each swap page has a reference
count, indicating how many different tasks/data structures have a 
reference on that swap page. One reference count comes from the swapcache
itself. Plus one for each process sharing the page. Note that if the user's
virtual page is not in core, the swap_count will not have any reference 
corresponding to the swapcache.

PAGE STEALING/KSWAPD
--------------------
When memory runs low, and a process can not find a free page, it wakes
up kswapd, the memory stealer, as well as it itself tries to steal some
pages. This includes trying to trim and free up some pages held in the
slab and inode cache, trying to rip away user reference to pages, and 
trying to rip away kernel references to pages. 

The "swap_out" routine identifies a process to steal memory from (based 
on rss), and then rotors thru its address space selecting a page to steal. 
Recently accessed pages (young ptes) are not stolen, else, the user's
reference to the page is deleted. Depending on whether the page is a shm
or file page (really whether the controlling vma has a "swapout" routine),
different actions are taken: dirty file pages are handed to kpiod to be
written out asynchronously, shm pages have no associated action, and 
anon pages are also asynchronously pushed to swap. For the anon case, the
swap handle is stored in the pte, for the other cases, the pte is cleared.

For pages controlled by vma's with their own swapout routines (shm,
map-shared files), the pte is cleared when the page is stolen, so 
"do_no_page" is invoked, so if the vma has a swapout routine, it
better also have a nopage routine. The only other case under which
do_no_page is invoked is if the page is anonymous, but it has not
been allocated yet.

The real page freeing is done by shrink_mmap: it scans physical memory
sequentially, and tries to rip away kernel references to the pages one
by one. It can do this for only one kernel reference. It attempts to 
release unused buffer pages, swapcache pages and filecache pages, by 
ripping the page away from the corresponding data structures.

Anon page stealing:

The way a non-file, non-shm page gets freed up by kswapd/stealers is
this: try_to_swap_out adds the page to the swapcache/hash Q, does an async 
rw_swap_page, and frees the user reference. The page is cleaned up by
the swap io routines, and then ends up with just a swapcache reference.
A shrink_mmap at that stage will find it and put it in the free list.

When a page is in the swapcache/hash Q, its "offset" field tracks the swap
handle which identifies where on the swap device it is. When the page
is stolen, this swapentry is put in the pte, so that the process can
do a do_swap_page on a "page-not-present" fault. Note that the SwapCache
is searched for by the same underlying routines that do PageCache 
searching (__find_page), hence when a swap_in happens, the code ends
up setting PG_referenced, waiting for PG_locked to go off [[ but not
PG_uptodate? ]]

Shm page stealing:

shm pages are swapped a little differently. The shm data structure
keeps a reference on each page in core controlled by it. 
try_to_swap_out just releases the reference from user program on these
pages, cleaning the pte and shm_swapout does no io. do_try_to_free_pages 
invokes shm_swap, which will identify pages that it controls, and which 
no one else (including any other user) has a reference on, and free the
page after getting a swaphandle for it and completing io on it (since 
this temporarily "puts" the page in the swapcache, a temp refcount inc
is done to protect the page from shrink_mmap freeing it out). These
pages are not put in the swapcache. When a user wants to access the
page, he invokes shm_nopage which will fire off a swap in request,
again making sure the page is not put into the swapcache. The shm code
uses the rw_swap_page_nocache code to mimic putting a page in the swap
cache (to fool rw_swap_page) by setting inode/offset/PG_swap_cache then
doing the rw_swap_page, and clearing off the inode/PG_swap_cache. Hence,
for shm pages too, the offset gives the swaphandle. Note that 
rw_swap_page_nocache has to handle being invoked for locked page on
which io is in progress. So that shrink_mmap does not erroneously try
to free out a shm page that has been temporarily marked "inswapcache",
the shm swapping code temporarily bumps the page reference count.

File page stealing:

file pages are handled almost similarly. The filemap_swapout/
filemap_write_page increments the page reference count, passing the
page to kpiod to do a do_write_page and free the page. All the time,
the page is in the inode's page cache, and later shrink_mmap will find
the page and free it and remove it from the page cache.