3 TRAPS - HOW SUNOS HANDLES THEM
In this section we will look at how SunOS handles traps and look
at some of the alternatives which were available. Despite all the
differences between SPARC v8 and v9 traps I'll do a fairly generic
description here as it really isn't necessary to describe in detail
what SunOS does for v9 traps as you can see from the previous
section what the differences in trap processing are. Suffice to
say that the SunOS kernel adheres to those rules. Instead, we'll
concentrate on the principles used by the kernel when handling
various traps.
We'll look at some specifics in a moment but first we'll cover the
generic trap handling algorithm.
When traps are handled, the typical procedure is as follows:
SPARC uses a large number of registers, some of which are globally
accessible, some are accessible only from supervisor mode and
the rest are divided up into "windows" of 8 local registers,
8 input registers and 8 output registers. The number of register
windows is implementation dependent but is usually 7 or 8.
Register windows appear to be arranged in a ring to software. This
means that if you move round the ring you will wrap around to the
first window again. This gives software the illusion of a seemingly
infinite number of register windows. Software uses the "save"
instruction to move round to a new window and "restore" to retreat
back to a previous window. This is most commonly used for procedure
calls so that each procedure has it's own private set of local
registers for it's own exclusive use. To help you visualize the
structure and use of register windows, imagine that each register
window is a cardboard box. Within each cardboard box there is a
set of 8 small compartments which are private to that box (the local
registers). On one side of the box, you have an "in" tray for your
input registers and on the other, an "out" tray for your output
registers. Now imagine that these boxes are arranged in a circle
with the "out" tray of each box overlapping the "in" tray of the
next box. Finally, place a box in the centre of the ring marked
"global" for the global registers.
If you assume that you are the executing code in a register window,
you will be standing in one of the boxes in the ring. You will
have exclusive use of your 8 local registers and you will be
sharing your "out" registers with the next window's "in" registers,
and your "in" registers with the previous window's "out" registers.
Imagine for example that you wish to make a function call to add two
numbers together. You place these two numbers in the first two
"out" registers and call the function. You are now executing the
addition function which, for the sake of argument, we will assume
requires some local registers to do the addition. The first thing
that the addition function code does is a "save" instruction to
move into the next register window so that it can have it's own
set of local registers (step into the next cardboard box). Now,
you have a new set of local registers and the parameters to the
addition function are in your "in" registers, because it overlaps
the previous window's "out" registers.
To illustrate, consider this code:
.global simple_add .type simple_add, #function simple_add: save %sp, -96, %sp ! Change window and save some stack mov %i0, %l0 ! Load parameters into locals mov %i1, %l1 add %l0, %l1, %i0 ! Add and write result into %i0 ret ! Return to main restore ! delay slot: back to previous window .global main .type main, #function main: mov 1, %o0 ! Move '1' into output reg zero call simple_add ! Call simple_add function mov 2, %o1 ! delay slot: move two into %o1 ...In this example, we have two functions. "main" loads the values 1 and 2 into the first two output registers zero and one (%o0 and %o1). Then it calls the "simple_add" function which does a "save" to get into a new register window and then it loads the parameters which are now in the "in" registers (%i0 and %i1) into some local registers, adds them together writing the result back into input register zero (%i0). This could be greatly optimized but this is just an example. After the add, the function returns to "main" by the "ret" instruction and the "restore" instruction (which is also executed in the ret delay slot) moves us back into the previous window (step back into the original box). Now we have the result of the addition returned in our first output register (%o0).
Window 0 Bit 1 is set if invalid (value = 1) Window 1 Bit 2 is set if invalid (value = 2) Window 2 Bit 3 is set if invalid (value = 4) Window 3 Bit 4 is set if invalid (value = 8) Window 4 Bit 5 is set if invalid (value = 16) Window 5 Bit 6 is set if invalid (value = 32) Window 6 Bit 7 is set if invalid (value = 64) Window 7 Bit 8 is set if invalid (value = 128)So by logical bitwise operations we can test and set invalid windows in the WIM. If an attempt is made to move into an invalid window by a "save" instruction, the processor will generate a window overflow trap (window spill on SPARC v9). Conversely, if we take the opposite scenario where we attempt to retreat back into a register window that is marked invalid in the WIM via a "restore" instruction, the processor will generate a window underflow trap (or a window_fill trap on SPARC v9). This means that we can use this behaviour to catch situations where we are about to wrap around the ring of windows onto a previously used window by marking the last window as invalid. Then, when we get the window overflow (or spill) trap, we can circumvent the problem by preserving the next valid window on the stack, make that the new invalid window and validate the currently invalid window so that we can continue into the next window safely. Also, when we retreat back through the register window circle, we can restore the previously saved window from the stack because we will get a window underflow trap (or fill) when we attempt to "restore" back into it.
3.2.1 Register Windows, SPARC v9 State Registers
One of the major differences between SPARC v8 and v9 is that v9 has a set of privileged state registers to describe the state of the register window file. You need to read this section to understand the sections on the SPARC v9 Window Spill and Fill traps below.
3.2.2 SPARC v7/v8 Window Overflow Handling
When a window overflow trap occurs under SunOS, the trap handler knows the following:
! ! On entry: ! ! %l1 = trapped %pc (save) ! %l2 = trapped %npc ! .global window_overflow .type window_overflow, #function window_overflow: ! ! Read the current WIM ! mov %wim, %l0 ! ! Find out how many register windows are implemented ! from 'nwindows', a variable we set when we first ! start ! sethi %hi(nwindows), %l4 ld [%lo(nwindows) + %l4], %l4 ! ! subtract 1 from the value in nwindows so that our ! modulo maths will work ! sub %l4, 1, %l4 ! ! now rotate the WIM right by 1 (modulo nwindows) so ! that the next window is marked invalid. Once we have ! moved to the next window (to save it on the stack) we ! write the new WIM value. However, our calculation here ! is done using locals so we must preserve a global register ! and use that to contain the result so that we can still ! see it when we change windows. ! mov %g1, %l6 ! Preserve a global srl %l0, 1, %l5 sll %l0, %l4, %l0 or %l0, %l5, %g1 ! %g1 = new WIM ! ! move to the next window, set the new WIM value and save ! the volatile window registers (local's and in's) to the ! stack, the "outs" don't matter. ! save mov %g1, %wim std %l0, [%sp] ! %sp is double word aligned std %l2, [%sp + 8] std %l4, [%sp + 16] std %l6, [%sp + 24] std %i0, [%sp + 32] std %i2, [%sp + 40] std %i4, [%sp + 48] std %i6, [%sp + 56] ! ! Return to the trap window and restore the global register ! restore mov %l6, %g1 ! ! All done. Return from the trap ! jmp %l1 rett %l2The actual SunOS overflow handler has much more to it. The biggest difference is that the above example doesn't differentiate between kernel and user windows whereas the SunOS one is forced to. This means that the SunOS kernel has to check that the user's stack is paged in, aligned and valid for writing. If not, the window has to be saved in a buffer temporarily while we raise a user page fault to get the page into memory. Then we can continue with handling the overflow.
3.2.3 SPARC v7/v8 Window Underflow Handling
As for the overflow case, the window underflow handler can safely
make some assumptions about the system state. These are:
! ! On entry: ! ! %l1 = trapped %pc (save) ! %l2 = trapped %npc ! .global window_underflow .type window_underflow, #function window_underflow: ! ! Read the current WIM ! mov %wim, %l0 ! ! Find out how many register windows are implemented ! from 'nwindows', a variable we set when we first ! start ! sethi %hi(nwindows), %l4 ld [%lo(nwindows) + %l4], %l4 ! ! subtract 1 from the value in nwindows so that our ! modulo maths will work ! sub %l4, 1, %l4 ! ! Rotate the WIM left by one and set that new value ! in the %wim register ! sll %l0, 1, %l6 srl %l0, %l4, %l5 or %l5, %l6, %l5 mov %l5, %wim ! ! Writes to the %wim have a potential 3-cycle latency so ! we can't change window until then. Use 'nop' instructions ! in the delay cycles... ! nop; nop; nop ! ! Okay, now we restore twice to get into the target window ! (the one that was marked invalid) and we restore it from ! the stack ! restore restore ldd [%sp], %l0 ldd [%sp + 8], %l2 ldd [%sp + 16], %l4 ldd [%sp + 24], %l6 ldd [%sp + 32], %i0 ldd [%sp + 40], %i2 ldd [%sp + 48], %i4 ldd [%sp + 56], %i6 ! ! Get back to the trap window and return from ! the trap ! save save jmp %l1 rett %l2Again, just as in the overflow case, the actual SunOS underflow handler has to cope with windows used in user mode and therefore it has to be sure that the users stack is mapped in and valid.
When a save instruction is executed and CANSAVE = 0, we have a
window overflow exception, called a Spill Trap. (If CANSAVE means
nothing to you, go back and read section 3.2.1 for a description
of the window state registers).
If OTHERWIN is zero, then we know that no other windows are used
for alternate address spaces. In this case, the trap vector taken
is determined by the bit value in the WSTATE.NORMAL field. We
have a choice of eight normal spill trap vectors and the kernel
can select which trap vector to use by asserting the corresponding
bit(s) in the WSTATE.NORMAL field. For example, if WSTATE.NORMAL
contains the value '4', the spill trap taken will be spill_4_normal
(trap 0x084).
If OTHERWIN is non-zero, the same vectoring strategy is used but
this time the vector is determined by the value in the WSTATE.OTHER
field. Using a similar example, if a spill trap occurs and the
value contained in WSTATE.OTHER is '4', the spill trap taken will
be spill_4_other (trap 0x0A4).
When a spill trap occurs, the CWP is changed so that we are in
the window we need to "spill" into. All we need to do is save the
window and return.
The spill trap vector entries in the trap table contain the first
32 instructions of the trap handler. In fact, the entire trap
handler can be contained within this 32 instruction space! The
basic handler/vector would look similar to this:
! ! This is an example for spilling in a 32-bit address ! space. For 64-bit, use stx instruction and adjust ! the %sp offsets accordingly ! ! CWP has been set so that we are in the window that ! we need to spill (save). Save the volatile window ! registers to the stack ! st %l0, [%sp + 0] st %l1, [%sp + 4] st %l2, [%sp + 8] st %l3, [%sp + 12] st %l4, [%sp + 16] st %l5, [%sp + 20] st %l6, [%sp + 24] st %l7, [%sp + 28] st %i0, [%sp + 32] st %i1, [%sp + 36] st %i2, [%sp + 40] st %i3, [%sp + 44] st %i4, [%sp + 48] st %i5, [%sp + 52] st %i6, [%sp + 56] st %i7, [%sp + 60] ! ! Now the window is saved, we do a "saved" instruction ! to adjust CANSAVE and CANRESTORE accordingly so that ! we can retry the trapped instruction without raising ! an exception again. ! saved ! ! Tell the IU to retry the trapped instruction... ! retry ! ! That's all folks...add nops or ".skip"s to fill ! the remainder of the 32 instruction space !Note that it is up to the kernel to utilise the WSTATE.NORMAL and WSTATE.OTHER in it's own way and this means that you may want to write your vector code differently so that you store to different address spaces (remember that OTHERWIN and WSTATE allows us to have more than one address space...we can use primary and secondary address space identifiers in our spill vectors and use the WSTATE fields to select which vector we use).
Window fill works in the same way as the window spill handlers.
The IU will adjust the CWP so that we are in the window that we
need to "fill" and so we restore that window from the stack and
return.
Just as for the window spill case, the OTHERWIN register determines
whether we vector to a normal fill trap or an other fill trap.
To use a similar example as in the spill section, assume that
OTHERWIN is zero and the value in WSTATE.NORMAL is '4'. This would
result in us vectoring to the fill_4_normal trap (0x0C4). If the
OTHERWIN register was non-zero and WSTATE.OTHER happened to contain
the value '4', we would vector to fill_4_other (trap 0x0E4).
As explained previously, the kernel could (and does in the case of
SunOS) use more than one address space in the register file, making
use of the OTHERWIN and WSTATE register fields. However, in order
to complete the picture, the basic fill vector would look something
like this:
! ! The CWP has been set so that we are in the window ! to "fill". This window has been previously "spilled" ! so we restore from the stack ! ld [%sp + 0], %l0 ld [%sp + 4], %l1 ld [%sp + 8], %l2 ld [%sp + 12], %l3 ld [%sp + 16], %l4 ld [%sp + 20], %l5 ld [%sp + 24], %l6 ld [%sp + 28], %l7 ld [%sp + 32], %i0 ld [%sp + 36], %i1 ld [%sp + 40], %i2 ld [%sp + 44], %i3 ld [%sp + 48], %i4 ld [%sp + 52], %i5 ld [%sp + 56], %i6 ld [%sp + 60], %i7 ! ! Adjust CANSAVE and CANRESTORE accordingly ! restored ! ! Retry the trapped instruction ! retry ! ! That's it...fill the remaining 14 instructions ! with nops or a ".skip" directive !As I mentioned in the "spill" section previously, you can use the lda (load alternate) instructions to load from a different address space if you are using more than one address space in your register window file. There are Address Space Identifiers (ASI's) which can be used to specify where to load or store to. These represent primary and secondary address spaces, optionally with little endian access and/or user privileges (refer to the SPARC v9 architecture manual). You would use the WSTATE fields to select an appropriate trap vector and your trap vectors would contain the sta (store alternate) or lda (load alternate) instructions to a specific ASI as required.
When the IU detects that an interrupt is pending, it will generate
a trap and vector into the trap table so that the interrupt can be
handled (see appendices for the numeric trap values that correspond
to the interrupts and their priorities).
The generic process for handling an interrupt in the kernel is to
raise the processor interrupt level (PIL) to the same level as the
occurring interrupt so that we do not run the risk of a lower
priority interrupt butting in on our current interrupt handler.
Then we would clear the interrupt pending bit in the system
interrupt pending register (SIPR) and handle the interrupt with
some code relevant to the interrupt type. When we are finished
with handling the interrupt, we restore the PIL to it's original
value and return from the trap with a rett instruction.
Interrupts are typically handled on a per-cpu interrupt stack and
in some cases, the interrupt is cleared and a lower-priority
soft interrupt is posted for a device driver or similar to deal
with later. We want to do as little as possible on receipt of a
high level hard interrupt to avoid difficulties with deadlocks due
to blocking I/O and lock contention (not to mention performance).
The kernel can determine whether the interrupt is a soft interrupt
by checking a specific soft interrupt bit in the interrupt
register.
The default trap table in the SunOS kernel directs most traps
through a generic trap handler front end which then decides which
lower level handler to call based on the trap type. If the trap
type is an interrupt (as in this case) the generic system trap
front end calls _interrupt() to decide what to do with it.
The _interrupt() routine compares the interrupt level against the
high level threshold of the system (the LOCK LEVEL, typically that
of the level-10 clock). If the interrupt level is below the lock
level, the interrupt will be handled as a separate interrupt thread.
If not, then _interrupt() first checks to see if this interrupt
is a level-10 clock interrupt and if it is, it jumps directly
to the level-10 handler. The level-10 handler typically calls
clock(), the function that is the root of all scheduling and
callout queue administration. (The level-10 interrupt is set to
interrupt every 10 ms by default although the clock chip is
programmable. As of Solaris 2.6, it is possible for system
administrators to affect the timing of the clock interrupts).
If the interrupt is not a level-10 interrupt (or if the interrupt
is handled on a lower-priority interrupt thread) the kernel will
have to look through all the interrupt service routines (ISR's)
that have registered an interest in this level of interrupt. If
there are no ISR's registered then the kernel will print out a
"spurious level 'x' interrupt..." message. If there are one or
more ISR's registered for a given interrupt level, the kernel
will call each ISR vector one by one. On return from each ISR,
the kernel checks the SIPR mask to see if the ISR has serviced
the interrupt and cleared the pending bit. If no ISR services
the interrupt then the kernel will print out a message on the
console like "level 'x' interrupt not serviced...".
Note that the higher level interrupts which are not handled with
interrupt threads in effect commandeer the current thread running
on the processor. This is necessary because scheduling of threads
is initiated by the receipt of a level-10 interrupt and so it does
not make sense to use an interrupt thread which is subject to
level-10 scheduling for interrupts of a higher priority! However,
this has certain implications for high level ISR writers because
they have to take into account that they cannot use any blocking
calls (or locks) that are affected by lower priority threads as
they run the risk of a deadlock situation where the high priority
ISR is blocked on a resource held by a lower priority thread and
that resource will never be freed because the PIL has been set
to the current interrupt level (meaning that level-10 clock
interrupts won't occur, therefore threads won't get scheduled and
the resource can never be released by the owning thread). For
this reason, high level ISR's will do as much as they are able to
and then post a soft interrupt so that an interrupt thread can be
started at a lower priority to finish off handling the interrupt,
if necessary.
A text or data fault occurs when an access is attempted to an address that is not valid in the MMU. For example, if an attempt is made to fetch an instruction from the address in %pc but that address is not backed by a valid mapping, then the IU will raise a text fault. Alternatively, if the action of executing an instruction attempts to write or read from an address that is not backed by a valid mapping, a data fault would be raised.
Basically, what happens is that the action of fetching an instruction (and often the action of executing that instruction) will cause virtual addresses to be presented to the MMU (either sunmmu on sun4 and sun4c systems or the sparc reference mmu (srmmu) for sun4m).
The MMU will walk through it's translation lookaside buffer (TLB), which is it's cache of recent virtual to physical translations, to see if it has an entry for this virtual address. If it does, it returns the appropriate physical address so that the MMU hardware can access the actual target address. If the virtual address does not exist in the TLB, the MMU will then walk through it's translation tables in memory to resolve the address to a valid physical address. Ultimately the walk through will lead down to a table of page table entries (PTE's) which contain various flags which indicate whether the page is valid (mapped in) or not. If the PTE does not exist, or if the valid flag is not set, or if the page is marked as swapped out, then this causes the MMU to raise an exception which in turn causes the IU to raise a text or data fault according to the type of access (instruction fetch or data access). This description is pretty generic as it would take quite a lot of detail on the MMU hardware and software table layout to explain it at a lower level. Anyway, you get the picture I hope. The kernel is notified of this MMU exception when the text/data fault comes in through the trap table. The fault handler will then attempt to correct the fault by mapping in the required page. For example, a text fault may indicate to the kernel that the next page of program text needs to be mapped in from the executable file. The kernel would then access the executable file via the vnode to satisfy that mapping. Likewise, if it's a data fault, the kernel will attempt to retrieve the page either from the executable's data segment in the file or from swap if the page was swapped out.
If the address is totally bogus (ie. a bad pointer) then there will be no way to satisfy the fault and this will result in either a "bad trap, text/data fault" panic if it's a kernel fault or a segmentation fault signal (SIGSEGV) being sent to the user process if it's a user fault. One other possibility is a SIGBUS to signal a bus error. This is usually because the user fault would normally be satisfied by mapping in the page but for some reason it doesn't exist any more or because the underlying executable file has been changed. Some UNIX implementations do disallow this and return ETXTBSY to the writing process but Solaris will allow writes to executing files. This is a direct consequence of the new vm system. The system must be able to write mapped files, including those that are mapped for execution. Since with NFS there's no way to determine that someone is executing out of a file, and with shared libraries it's difficult to determine that someone is executing out of a library, and since there are a number of other applications in which it is important to be able to write mappings that are also being executed (self-modifying code), ETXTBSY no longer makes sense. In the case of a kernel initiated fault, we can tell that it is a kernel fault because the processor will be in supervisor mode when the trap was taken and this is indicated in the PSR. Also, we know that kernel pages are never swapped out so that cuts down on page fault processing overhead. The text faulting facility is the basis of demand paging.