arch/xtensa/core/README_MMU.txt - third_party/github/zephyrproject-rtos/zephyr - Git at Google

 # Xtensa MMU Operation

 As with other elements of the architecture, paged virtual memory
 management on Xtensa is somewhat unique.  And there is similarly a
 lack of introductory material available.  This document is an attempt
 to introduce the architecture at an overview/tutorial level, and to
 describe Zephyr's specific implementation choices.

 ## General TLB Operation

 The Xtensa MMU operates on top of a fairly conventional TLB cache.
 The TLB stores virtual to physical translation for individual pages of
 memory.  It is partitioned into an automatically managed
 4-way-set-associative bank of entries mapping 4k pages, and 3-6
 "special" ways storing mappings under OS control.  Some of these are
 for mapping pages larger than 4k, which Zephyr does not directly
 support.  A few are for bootstrap and initialization, and will be
 discussed below.

 Like the L1 cache, the TLB is split into separate instruction and data
 entries.  Zephyr manages both as needed, but symmetrically.  The
 architecture technically supports separately-virtualized instruction
 and data spaces, but the hardware page table refill mechanism (see
 below) does not, and Zephyr's memory spaces are unified regardless.

 The TLB may be loaded with permissions and attributes controlling
 cacheability, access control based on ring (i.e. the contents of the
 RING field of the PS register) and togglable write and execute access.
 Memory access, even with a matching TLB entry, may therefore create
 Kernel/User exceptions as desired to enforce permissions choices on
 userspace code.

 Live TLB entries are tagged with an 8-bit "ASID" value derived from
 their ring field of the PTE that loaded them, via a simple translation
 specified in the RASID special register.  The intent is that each
 non-kernel address space will get a separate ring 3 ASID set in RASID,
 such that you can switch between them without a TLB flush.  The ASID
 value of ring zero is fixed at 1, it may not be changed.  (An ASID
 value of zero is used to tag an invalid/unmapped TLB entry at
 initialization, but this mechanism isn't accessible to OS code except
 in special circumstances, and in any case there is already an invalid
 attribute value that can be used in a PTE).

 ## Virtually-mapped Page Tables

 Xtensa has a unique (and, to someone exposed for the first time,
 extremely confusing) "page table" format.  The simplest was to begin
 to explain this is just to describe the (quite simple) hardware
 behavior:

 On a TLB miss, the hardware immediately does a single fetch (at ring 0
 privilege) from RAM by adding the "desired address right shifted by
 10 bits with the bottom two bits set to zero" (i.e. the page frame
 number in units of 4 bytes) to the value in the PTEVADDR special
 register.  If this load succeeds, then the word is treated as a PTE
 with which to fill the TLB and use for a (restarted) memory access.
 This is extremely simple (just one extra hardware state that does just
 one thing the hardware can already do), and quite fast (only one
 memory fetch vs. e.g. the 2-5 fetches required to walk a page table on
 x86).

 This special "refill" fetch is otherwise identical to any other memory
 access, meaning it too uses the TLB to translate from a virtual to
 physical address.  Which means that the page tables occupy a 4M region
 of virtual, not physical, address space, in the same memory space
 occupied by the running code.  The 1024 pages in that range (not all
 of which might be mapped in physical memory) are a linear array of
 1048576 4-byte PTE entries, each describing a mapping for 4k of
 virtual memory.  Note especially that exactly one of those pages
 contains the 1024 PTE entries for the 4M page table itself, pointed to
 by PTEVADDR.

 Obviously, the page table memory being virtual means that the fetch
 can fail: there are 1024 possible pages in a complete page table
 covering all of memory, and the ~16 entry TLB clearly won't contain
 entries mapping all of them.  If we are missing a TLB entry for the
 page translation we want (NOT for the original requested address, we
 already know we're missing that TLB entry), the hardware has exactly
 one more special trick: it throws a TLB Miss exception (there are two,
 one each for instruction/data TLBs, but in Zephyr they operate
 identically).

 The job of that exception handler is simply to ensure that the TLB has
 an entry for the page table page we want.  And the simplest way to do
 that is to just load the faulting PTE as an address, which will then
 go through the same refill process above.  This second TLB fetch in
 the exception handler may result in an invalid/inapplicable mapping
 within the 4M page table region.  This is an typical/expected runtime
 fault, and simply indicates unmapped memory.  The result is TLB miss
 exception from within the TLB miss exception handler (i.e. while the
 EXCM bit is set).  This will produce a Double Exception fault, which
 is handled by the OS identically to a general Kernel/User data access
 prohibited exception.

 After the TLB refill exception, the original faulting instruction is
 restarted, which retries the refill process, which succeeds in
 fetching a new TLB entry, which is then used to service the original
 memory access.  (And may then result in yet another exception if it
 turns out that the TLB entry doesn't permit the access requested, of
 course.)

 ## Special Cases

 The page-tables-specified-in-virtual-memory trick works very well in
 practice.  But it does have a chicken/egg problem with the initial
 state.  Because everything depends on state in the TLB, something
 needs to tell the hardware how to find a physical address using the
 TLB to begin the process.  Here we exploit the separate
 non-automatically-refilled TLB ways to store bootstrap records.

 First, note that the refill process to load a PTE requires that the 4M
 space of PTE entries be resolvable by the TLB directly, without
 requiring another refill.  This 4M mapping is provided by a single
 page of PTE entries (which itself lives in the 4M page table region!).
 This page must always be in the TLB.

 Thankfully, for the data TLB Xtensa provides 3 special/non-refillable
 ways (ways 7-9) with at least one 4k page mapping each.  We can use
 one of these to "pin" the top-level page table entry in place,
 ensuring that a refill access will be able to find a PTE address.

 But now note that the load from that PTE address for the refill is
 done in an exception handler.  And running an exception handler
 requires doing a fetch via the instruction TLB.  And that obviously
 means that the page(s) containing the exception handler must never
 require a refill exception of its own.

 Ideally we would just pin the vector/handler page in the ITLB in the
 same way we do for data, but somewhat inexplicably, Xtensa does not
 provide 4k "pinnable" ways in the instruction TLB (frankly this seems
 like a design flaw).

 Instead, we load ITLB entries for vector handlers via the refill
 mechanism using the data TLB, and so need the refill mechanism for the
 vector page to succeed always.  The way to do this is to similarly pin
 the page table page containing the (single) PTE for the vector page in
 the data TLB, such that instruction fetches always find their TLB
 mapping via refill, without requiring an exception.

 ## Initialization

 Unlike most other architectures, Xtensa does not have a "disable" mode
 for the MMU.  Virtual address translation through the TLB is active at
 all times.  There therefore needs to be a mechanism for the CPU to
 execute code before the OS is able to initialize a refillable page
 table.

 The way Xtensa resolves this (on the hardware Zephyr supports, see the
 note below) is to have an 8-entry set ("way 6") of 512M pages able to
 cover all of memory.  These 8 entries are initialized as valid, with
 attributes specifying that they are accessible only to an ASID of 1
 (i.e. the fixed ring zero / kernel ASID), writable, executable, and
 uncached.  So at boot the CPU relies on these TLB entries to provide a
 clean view of hardware memory.

 But that means that enabling page-level translation requires some
 care, as the CPU will throw an exception ("multi hit") if a memory
 access matches more than one live entry in the TLB.  The
 initialization algorithm is therefore:

 0. Start with a fully-initialized page table layout, including the
    top-level "L1" page containing the mappings for the page table
    itself.

 1. Ensure that the initialization routine does not cross a page
    boundary (to prevent stray TLB refill exceptions), that it occupies
    a separate 4k page than the exception vectors (which we must
    temporarily double-map), and that it operates entirely in registers
    (to avoid doing memory access at inopportune moments).

 2. Pin the L1 page table PTE into the data TLB.  This creates a double
    mapping condition, but it is safe as nothing will use it until we
    start refilling.

 3. Pin the page table page containing the PTE for the TLB miss
    exception handler into the data TLB.  This will likewise not be
    accessed until the double map condition is resolved.

 4. Set PTEVADDR appropriately.  The CPU state to handle refill
    exceptions is now complete, but cannot be used until we resolve the
    double mappings.

 5. Disable the initial/way6 data TLB entries first, by setting them to
    an ASID of zero.  This is safe as the code being executed is not
    doing data accesses yet (including refills), and will resolve the
    double mapping conditions we created above.

 6. Disable the initial/way6 instruction TLBs second.  The very next
    instruction following the invalidation of the currently-executing
    code page will then cause a TLB refill exception, which will work
    normally because we just resolved the final double-map condition.
    (Pedantic note: if the vector page and the currently-executing page
    are in different 512M way6 pages, disable the mapping for the
    exception handlers first so the trap from our current code can be
    handled.  Currently Zephyr doesn't handle this condition as in all
    reasonable hardware these regions will be near each other)

 Note: there is a different variant of the Xtensa MMU architecture
 where the way 5/6 pages are immutable, and specify a set of
 unchangable mappings from the final 384M of memory to the bottom and
 top of physical memory.  The intent here would (presumably) be that
 these would be used by the kernel for all physical memory and that the
 remaining memory space would be used for virtual mappings.  This
 doesn't match Zephyr's architecture well, as we tend to assume
 page-level control over physical memory (e.g. .text/.rodata is cached
 but .data is not on SMP, etc...).  And in any case we don't have any
 such hardware to experiment with.  But with a little address
 translation we could support this.

 ## ASID vs. Virtual Mapping

 The ASID mechanism in Xtensa works like other architectures, and is
 intended to be used similarly.  The intent of the design is that at
 context switch time, you can simply change RADID and the page table
 data, and leave any existing mappings in place in the TLB using the
 old ASID value(s).  So in the common case where you switch back,
 nothing needs to be flushed.

 Unfortunately this runs afoul of the virtual mapping of the page
 refill: data TLB entries storing the 4M page table mapping space are
 stored at ASID 1 (ring 0), they can't change when the page tables
 change!  So this region naively would have to be flushed, which is
 tantamount to flushing the entire TLB regardless (the TLB is much
 smaller than the 1024-page PTE array).

 The resolution in Zephyr is to give each ASID its own PTEVADDR mapping
 in virtual space, such that the page tables don't overlap.  This is
 expensive in virtual address space: assigning 4M of space to each of
 the 256 ASIDs (actually 254 as 0 and 1 are never used by user access)
 would take a full gigabyte of address space.  Zephyr optimizes this a
 bit by deriving a unique sequential ASID from the hardware address of
 the statically allocated array of L1 page table pages.

 Note, obviously, that any change of the mappings within an ASID
 (e.g. to re-use it for another memory domain, or just for any runtime
 mapping change other than mapping previously-unmapped pages) still
 requires a TLB flush, and always will.

 ## SMP/Cache Interaction

 A final important note is that the hardware PTE refill fetch works
 like any other CPU memory access, and in particular it is governed by
 the cacheability attributes of the TLB entry through which it was
 loaded.  This means that if the page table entries are marked
 cacheable, then the hardware TLB refill process will be downstream of
 the L1 data cache on the CPU.  If the physical memory storing page
 tables has been accessed recently by the CPU (for a refill of another
 page mapped within the same cache line, or to change the tables) then
 the refill will be served from the data cache and not main memory.

 This may or may not be desirable depending on access patterns.  It
 lets the L1 data cache act as a "L2 TLB" for applications with a lot
 of access variability.  But it also means that the TLB entries end up
 being stored twice in the same CPU, wasting transistors that could
 presumably store other useful data.

 But it it also important to note that the L1 data cache on Xtensa is
 incoherent!  The cache being used for refill reflects the last access
 on the current CPU only, and not of the underlying memory being
 mapped.  Page table changes in the data cache of one CPU will be
 invisible to the data cache of another.  There is no simple way of
 notifying another CPU of changes to page mappings beyond doing
 system-wide flushes on all cpus every time a memory domain is
 modified.

 The result is that, when SMP is enabled, Zephyr must ensure that all
 page table mappings in the system are set uncached.  The OS makes no
 attempt to bolt on a software coherence layer.
	# Xtensa MMU Operation

	As with other elements of the architecture, paged virtual memory
	management on Xtensa is somewhat unique. And there is similarly a
	lack of introductory material available. This document is an attempt
	to introduce the architecture at an overview/tutorial level, and to
	describe Zephyr's specific implementation choices.

	## General TLB Operation

	The Xtensa MMU operates on top of a fairly conventional TLB cache.
	The TLB stores virtual to physical translation for individual pages of
	memory. It is partitioned into an automatically managed
	4-way-set-associative bank of entries mapping 4k pages, and 3-6
	"special" ways storing mappings under OS control. Some of these are
	for mapping pages larger than 4k, which Zephyr does not directly
	support. A few are for bootstrap and initialization, and will be
	discussed below.

	Like the L1 cache, the TLB is split into separate instruction and data
	entries. Zephyr manages both as needed, but symmetrically. The
	architecture technically supports separately-virtualized instruction
	and data spaces, but the hardware page table refill mechanism (see
	below) does not, and Zephyr's memory spaces are unified regardless.

	The TLB may be loaded with permissions and attributes controlling
	cacheability, access control based on ring (i.e. the contents of the
	RING field of the PS register) and togglable write and execute access.
	Memory access, even with a matching TLB entry, may therefore create
	Kernel/User exceptions as desired to enforce permissions choices on
	userspace code.

	Live TLB entries are tagged with an 8-bit "ASID" value derived from
	their ring field of the PTE that loaded them, via a simple translation
	specified in the RASID special register. The intent is that each
	non-kernel address space will get a separate ring 3 ASID set in RASID,
	such that you can switch between them without a TLB flush. The ASID
	value of ring zero is fixed at 1, it may not be changed. (An ASID
	value of zero is used to tag an invalid/unmapped TLB entry at
	initialization, but this mechanism isn't accessible to OS code except
	in special circumstances, and in any case there is already an invalid
	attribute value that can be used in a PTE).

	## Virtually-mapped Page Tables

	Xtensa has a unique (and, to someone exposed for the first time,
	extremely confusing) "page table" format. The simplest was to begin
	to explain this is just to describe the (quite simple) hardware
	behavior:

	On a TLB miss, the hardware immediately does a single fetch (at ring 0
	privilege) from RAM by adding the "desired address right shifted by
	10 bits with the bottom two bits set to zero" (i.e. the page frame
	number in units of 4 bytes) to the value in the PTEVADDR special
	register. If this load succeeds, then the word is treated as a PTE
	with which to fill the TLB and use for a (restarted) memory access.
	This is extremely simple (just one extra hardware state that does just
	one thing the hardware can already do), and quite fast (only one
	memory fetch vs. e.g. the 2-5 fetches required to walk a page table on
	x86).

	This special "refill" fetch is otherwise identical to any other memory
	access, meaning it too uses the TLB to translate from a virtual to
	physical address. Which means that the page tables occupy a 4M region
	of virtual, not physical, address space, in the same memory space
	occupied by the running code. The 1024 pages in that range (not all
	of which might be mapped in physical memory) are a linear array of
	1048576 4-byte PTE entries, each describing a mapping for 4k of
	virtual memory. Note especially that exactly one of those pages
	contains the 1024 PTE entries for the 4M page table itself, pointed to
	by PTEVADDR.

	Obviously, the page table memory being virtual means that the fetch
	can fail: there are 1024 possible pages in a complete page table
	covering all of memory, and the ~16 entry TLB clearly won't contain
	entries mapping all of them. If we are missing a TLB entry for the
	page translation we want (NOT for the original requested address, we
	already know we're missing that TLB entry), the hardware has exactly
	one more special trick: it throws a TLB Miss exception (there are two,
	one each for instruction/data TLBs, but in Zephyr they operate
	identically).

	The job of that exception handler is simply to ensure that the TLB has
	an entry for the page table page we want. And the simplest way to do
	that is to just load the faulting PTE as an address, which will then
	go through the same refill process above. This second TLB fetch in
	the exception handler may result in an invalid/inapplicable mapping
	within the 4M page table region. This is an typical/expected runtime
	fault, and simply indicates unmapped memory. The result is TLB miss
	exception from within the TLB miss exception handler (i.e. while the
	EXCM bit is set). This will produce a Double Exception fault, which
	is handled by the OS identically to a general Kernel/User data access
	prohibited exception.

	After the TLB refill exception, the original faulting instruction is
	restarted, which retries the refill process, which succeeds in
	fetching a new TLB entry, which is then used to service the original
	memory access. (And may then result in yet another exception if it
	turns out that the TLB entry doesn't permit the access requested, of
	course.)

	## Special Cases

	The page-tables-specified-in-virtual-memory trick works very well in
	practice. But it does have a chicken/egg problem with the initial
	state. Because everything depends on state in the TLB, something
	needs to tell the hardware how to find a physical address using the
	TLB to begin the process. Here we exploit the separate
	non-automatically-refilled TLB ways to store bootstrap records.

	First, note that the refill process to load a PTE requires that the 4M
	space of PTE entries be resolvable by the TLB directly, without
	requiring another refill. This 4M mapping is provided by a single
	page of PTE entries (which itself lives in the 4M page table region!).
	This page must always be in the TLB.

	Thankfully, for the data TLB Xtensa provides 3 special/non-refillable
	ways (ways 7-9) with at least one 4k page mapping each. We can use
	one of these to "pin" the top-level page table entry in place,
	ensuring that a refill access will be able to find a PTE address.

	But now note that the load from that PTE address for the refill is
	done in an exception handler. And running an exception handler
	requires doing a fetch via the instruction TLB. And that obviously
	means that the page(s) containing the exception handler must never
	require a refill exception of its own.

	Ideally we would just pin the vector/handler page in the ITLB in the
	same way we do for data, but somewhat inexplicably, Xtensa does not
	provide 4k "pinnable" ways in the instruction TLB (frankly this seems
	like a design flaw).

	Instead, we load ITLB entries for vector handlers via the refill
	mechanism using the data TLB, and so need the refill mechanism for the
	vector page to succeed always. The way to do this is to similarly pin
	the page table page containing the (single) PTE for the vector page in
	the data TLB, such that instruction fetches always find their TLB
	mapping via refill, without requiring an exception.

	## Initialization

	Unlike most other architectures, Xtensa does not have a "disable" mode
	for the MMU. Virtual address translation through the TLB is active at
	all times. There therefore needs to be a mechanism for the CPU to
	execute code before the OS is able to initialize a refillable page
	table.

	The way Xtensa resolves this (on the hardware Zephyr supports, see the
	note below) is to have an 8-entry set ("way 6") of 512M pages able to
	cover all of memory. These 8 entries are initialized as valid, with
	attributes specifying that they are accessible only to an ASID of 1
	(i.e. the fixed ring zero / kernel ASID), writable, executable, and
	uncached. So at boot the CPU relies on these TLB entries to provide a
	clean view of hardware memory.

	But that means that enabling page-level translation requires some
	care, as the CPU will throw an exception ("multi hit") if a memory
	access matches more than one live entry in the TLB. The
	initialization algorithm is therefore:

	0. Start with a fully-initialized page table layout, including the
	top-level "L1" page containing the mappings for the page table
	itself.

	1. Ensure that the initialization routine does not cross a page
	boundary (to prevent stray TLB refill exceptions), that it occupies
	a separate 4k page than the exception vectors (which we must
	temporarily double-map), and that it operates entirely in registers
	(to avoid doing memory access at inopportune moments).

	2. Pin the L1 page table PTE into the data TLB. This creates a double
	mapping condition, but it is safe as nothing will use it until we
	start refilling.

	3. Pin the page table page containing the PTE for the TLB miss
	exception handler into the data TLB. This will likewise not be
	accessed until the double map condition is resolved.

	4. Set PTEVADDR appropriately. The CPU state to handle refill
	exceptions is now complete, but cannot be used until we resolve the
	double mappings.

	5. Disable the initial/way6 data TLB entries first, by setting them to
	an ASID of zero. This is safe as the code being executed is not
	doing data accesses yet (including refills), and will resolve the
	double mapping conditions we created above.

	6. Disable the initial/way6 instruction TLBs second. The very next
	instruction following the invalidation of the currently-executing
	code page will then cause a TLB refill exception, which will work
	normally because we just resolved the final double-map condition.
	(Pedantic note: if the vector page and the currently-executing page
	are in different 512M way6 pages, disable the mapping for the
	exception handlers first so the trap from our current code can be
	handled. Currently Zephyr doesn't handle this condition as in all
	reasonable hardware these regions will be near each other)

	Note: there is a different variant of the Xtensa MMU architecture
	where the way 5/6 pages are immutable, and specify a set of
	unchangable mappings from the final 384M of memory to the bottom and
	top of physical memory. The intent here would (presumably) be that
	these would be used by the kernel for all physical memory and that the
	remaining memory space would be used for virtual mappings. This
	doesn't match Zephyr's architecture well, as we tend to assume
	page-level control over physical memory (e.g. .text/.rodata is cached
	but .data is not on SMP, etc...). And in any case we don't have any
	such hardware to experiment with. But with a little address
	translation we could support this.

	## ASID vs. Virtual Mapping

	The ASID mechanism in Xtensa works like other architectures, and is
	intended to be used similarly. The intent of the design is that at
	context switch time, you can simply change RADID and the page table
	data, and leave any existing mappings in place in the TLB using the
	old ASID value(s). So in the common case where you switch back,
	nothing needs to be flushed.

	Unfortunately this runs afoul of the virtual mapping of the page
	refill: data TLB entries storing the 4M page table mapping space are
	stored at ASID 1 (ring 0), they can't change when the page tables
	change! So this region naively would have to be flushed, which is
	tantamount to flushing the entire TLB regardless (the TLB is much
	smaller than the 1024-page PTE array).

	The resolution in Zephyr is to give each ASID its own PTEVADDR mapping
	in virtual space, such that the page tables don't overlap. This is
	expensive in virtual address space: assigning 4M of space to each of
	the 256 ASIDs (actually 254 as 0 and 1 are never used by user access)
	would take a full gigabyte of address space. Zephyr optimizes this a
	bit by deriving a unique sequential ASID from the hardware address of
	the statically allocated array of L1 page table pages.

	Note, obviously, that any change of the mappings within an ASID
	(e.g. to re-use it for another memory domain, or just for any runtime
	mapping change other than mapping previously-unmapped pages) still
	requires a TLB flush, and always will.

	## SMP/Cache Interaction

	A final important note is that the hardware PTE refill fetch works
	like any other CPU memory access, and in particular it is governed by
	the cacheability attributes of the TLB entry through which it was
	loaded. This means that if the page table entries are marked
	cacheable, then the hardware TLB refill process will be downstream of
	the L1 data cache on the CPU. If the physical memory storing page
	tables has been accessed recently by the CPU (for a refill of another
	page mapped within the same cache line, or to change the tables) then
	the refill will be served from the data cache and not main memory.

	This may or may not be desirable depending on access patterns. It
	lets the L1 data cache act as a "L2 TLB" for applications with a lot
	of access variability. But it also means that the TLB entries end up
	being stored twice in the same CPU, wasting transistors that could
	presumably store other useful data.

	But it it also important to note that the L1 data cache on Xtensa is
	incoherent! The cache being used for refill reflects the last access
	on the current CPU only, and not of the underlying memory being
	mapped. Page table changes in the data cache of one CPU will be
	invisible to the data cache of another. There is no simple way of
	notifying another CPU of changes to page mappings beyond doing
	system-wide flushes on all cpus every time a memory domain is
	modified.

	The result is that, when SMP is enabled, Zephyr must ensure that all
	page table mappings in the system are set uncached. The OS makes no
	attempt to bolt on a software coherence layer.