arch/xtensa/core/README-WINDOWS.rst - third_party/github/zephyrproject-rtos/zephyr - Git at Google

 # How Xtensa register windows work

 There is a paucity of introductory material on this subject, and
 Zephyr plays some tricks here that require understanding the base
 layer.

 ## Hardware

 When register windows are configured in the CPU, there are either 32
 or 64 "real" registers in hardware, with 16 visible at one time.
 Registers are grouped and rotated in units of 4, so there are 8 or 16
 such "quads" (my term, not Tensilica's) in hardware of which 4 are
 visible as A0-A15.

 The first quad (A0-A3) is pointed to by a special register called
 WINDOWBASE.  The register file is cyclic, so for example if NREGS==64
 and WINDOWBASE is 15, quads 15, 0, 1, and 2 will be visible as
 (respectively) A0-A3, A4-A7, A8-A11, and A12-A15.

 There is a ROTW instruction that can be used to manually rotate the
 window by a immediate number of quads that are added to WINDOWBASE.
 Positive rotations "move" high registers into low registers
 (i.e. after "ROTW 1" the register that used to be called A4 is now
 A0).

 There are CALL4/CALL8/CALL12 instructions to effect rotated calls
 which rotate registers upward (i.e. "hiding" low registers from the
 callee) by 1, 2 or 3 quads.  These do not rotate the window
 themselves.  Instead they place the rotation amount in two places
 (yes, two; see below): the 2-bit CALLINC field of the PS register, and
 the top two bits of the return address placed in A0.

 There is an ENTRY instruction that does the rotation.  It adds CALLINC
 to WINDOWBASE, at the same time copying the old (now hidden) stack
 pointer in A1 into the "new" A1 in the rotated frame, subtracting an
 immediate offset from it to make space for the new frame.

 There is a RETW instruction that undoes the rotation.  It reads the
 top two bits from the return address in A0 and subtracts that value
 from WINDOWBASE before returning.  This is why the CALLINC bits went
 in two places.  They have to be stored on the stack across potentially
 many calls, so they need to be GPR data that lives in registers and
 can be spilled.  But ENTRY isn't specified to assume a particular
 return value format and is used immediately, so it makes more sense
 for it to use processor state instead.

 Note that we still don't know how to detect when the register file has
 wrapped around and needs to be spilled or filled.  To do this there is
 a WINDOWSTART register used to detect which register quads are in use.
 The name "start" is somewhat confusing, this is not a pointer.
 WINDOWSTART stores a bitmask with one bit per hardware quad (so it's 8
 or 16 bits wide).  The bit in windowstart corresponding to WINDOWBASE
 will be set by the ENTRY instruction, and remain set after rotations
 until cleared by a function return (by RETW, see below).  Other bits
 stay zero.  So there is one set bit in WINDOWSTART corresponding to
 each call frame that is live in hardware registers, and it will be
 followed by 0, 1 or 2 zero bits that tell you how "big" (how many
 quads of registers) that frame is.

 So the CPU executing RETW checks to make sure that the register quad
 being brought into A0-A3 (i.e. the new WINDOWBASE) has a set bit
 indicating it's valid. If it does not, the registers must have been
 spilled and the CPU traps to an exception handler to fill them.

 Likewise, the processor can tell if a high register is "owned" by
 another call by seeing if there is a one in WINDOWSTART between that
 register's quad and WINDOWBASE.  If there is, the CPU traps to a spill
 handler to spill one frame.  Note that a frame might be only four
 registers, but it's possible to hit registers 12 out from WINDOWBASE,
 so it's actually possible to trap again when the instruction restarts
 to spill a second quad, and even a third time at maximum.

 Finally: note that hardware checks the two bits of WINDOWSTART after
 the frame bit to detect how many quads are represented by the one
 frame.  So there are six separate exception handlers to spill/fill
 1/2/3 quads of registers.

 ## Software & ABI

 The advantage of the scheme above is that it allows the registers to
 be spilled naturally into the stack by using the stack pointers
 embedded in the register file.  But the hardware design assumes and to
 some extent enforces a fairly complicated stack layout to make that
 work:

 The spill area for a single frame's A0-A3 registers is not in its own
 stack frame.  It lies in the 16 bytes below its CALLEE's stack
 pointer.  This is so that the callee (and exception handlers invoked
 on its behalf) can see its caller's potentially-spilled stack pointer
 register (A1) on the stack and be able to walk back up on return.
 Other architectures do this too by e.g. pushing the incoming stack
 pointer onto the stack as a standard "frame pointer" defined in the
 platform ABI.  Xtensa wraps this together with the natural spill area
 for register windows.

 By convention spill regions always store the lowest numbered register
 in the lowest address.

 The spill area for a frame's A4-A11 registers may or may not exist
 depending on whether the call was made with CALL8/CALL12.  It is legal
 to write a function using only A0-A3 and CALL4 calls and ignore higher
 registers.  But if those 0-2 register quads are in use, they appear at
 the top of the stack frame, immediately below the parent call's A0-A3
 spill area.

 There is no spill area for A12-A15.  Those registers are always
 caller-save.  When using CALLn, you always need to overlap 4 registers
 to provide arguments and take a return value.
	# How Xtensa register windows work

	There is a paucity of introductory material on this subject, and
	Zephyr plays some tricks here that require understanding the base
	layer.

	## Hardware

	When register windows are configured in the CPU, there are either 32
	or 64 "real" registers in hardware, with 16 visible at one time.
	Registers are grouped and rotated in units of 4, so there are 8 or 16
	such "quads" (my term, not Tensilica's) in hardware of which 4 are
	visible as A0-A15.

	The first quad (A0-A3) is pointed to by a special register called
	WINDOWBASE. The register file is cyclic, so for example if NREGS==64
	and WINDOWBASE is 15, quads 15, 0, 1, and 2 will be visible as
	(respectively) A0-A3, A4-A7, A8-A11, and A12-A15.

	There is a ROTW instruction that can be used to manually rotate the
	window by a immediate number of quads that are added to WINDOWBASE.
	Positive rotations "move" high registers into low registers
	(i.e. after "ROTW 1" the register that used to be called A4 is now
	A0).

	There are CALL4/CALL8/CALL12 instructions to effect rotated calls
	which rotate registers upward (i.e. "hiding" low registers from the
	callee) by 1, 2 or 3 quads. These do not rotate the window
	themselves. Instead they place the rotation amount in two places
	(yes, two; see below): the 2-bit CALLINC field of the PS register, and
	the top two bits of the return address placed in A0.

	There is an ENTRY instruction that does the rotation. It adds CALLINC
	to WINDOWBASE, at the same time copying the old (now hidden) stack
	pointer in A1 into the "new" A1 in the rotated frame, subtracting an
	immediate offset from it to make space for the new frame.

	There is a RETW instruction that undoes the rotation. It reads the
	top two bits from the return address in A0 and subtracts that value
	from WINDOWBASE before returning. This is why the CALLINC bits went
	in two places. They have to be stored on the stack across potentially
	many calls, so they need to be GPR data that lives in registers and
	can be spilled. But ENTRY isn't specified to assume a particular
	return value format and is used immediately, so it makes more sense
	for it to use processor state instead.

	Note that we still don't know how to detect when the register file has
	wrapped around and needs to be spilled or filled. To do this there is
	a WINDOWSTART register used to detect which register quads are in use.
	The name "start" is somewhat confusing, this is not a pointer.
	WINDOWSTART stores a bitmask with one bit per hardware quad (so it's 8
	or 16 bits wide). The bit in windowstart corresponding to WINDOWBASE
	will be set by the ENTRY instruction, and remain set after rotations
	until cleared by a function return (by RETW, see below). Other bits
	stay zero. So there is one set bit in WINDOWSTART corresponding to
	each call frame that is live in hardware registers, and it will be
	followed by 0, 1 or 2 zero bits that tell you how "big" (how many
	quads of registers) that frame is.

	So the CPU executing RETW checks to make sure that the register quad
	being brought into A0-A3 (i.e. the new WINDOWBASE) has a set bit
	indicating it's valid. If it does not, the registers must have been
	spilled and the CPU traps to an exception handler to fill them.

	Likewise, the processor can tell if a high register is "owned" by
	another call by seeing if there is a one in WINDOWSTART between that
	register's quad and WINDOWBASE. If there is, the CPU traps to a spill
	handler to spill one frame. Note that a frame might be only four
	registers, but it's possible to hit registers 12 out from WINDOWBASE,
	so it's actually possible to trap again when the instruction restarts
	to spill a second quad, and even a third time at maximum.

	Finally: note that hardware checks the two bits of WINDOWSTART after
	the frame bit to detect how many quads are represented by the one
	frame. So there are six separate exception handlers to spill/fill
	1/2/3 quads of registers.

	## Software & ABI

	The advantage of the scheme above is that it allows the registers to
	be spilled naturally into the stack by using the stack pointers
	embedded in the register file. But the hardware design assumes and to
	some extent enforces a fairly complicated stack layout to make that
	work:

	The spill area for a single frame's A0-A3 registers is not in its own
	stack frame. It lies in the 16 bytes below its CALLEE's stack
	pointer. This is so that the callee (and exception handlers invoked
	on its behalf) can see its caller's potentially-spilled stack pointer
	register (A1) on the stack and be able to walk back up on return.
	Other architectures do this too by e.g. pushing the incoming stack
	pointer onto the stack as a standard "frame pointer" defined in the
	platform ABI. Xtensa wraps this together with the natural spill area
	for register windows.

	By convention spill regions always store the lowest numbered register
	in the lowest address.

	The spill area for a frame's A4-A11 registers may or may not exist
	depending on whether the call was made with CALL8/CALL12. It is legal
	to write a function using only A0-A3 and CALL4 calls and ignore higher
	registers. But if those 0-2 register quads are in use, they appear at
	the top of the stack frame, immediately below the parent call's A0-A3
	spill area.

	There is no spill area for A12-A15. Those registers are always
	caller-save. When using CALLn, you always need to overlap 4 registers
	to provide arguments and take a return value.