sysdumpdev -L
Note the primary dump device name, and use it where you see /dev/hd# in the following steps.
In AIX Version 3.2, the primary dump device will probably be /dev/hd7.
In AIX Version 4.x, the default dump device is set to /dev/hd6--your paging device. DO NOT use this device name in the following steps.
In some cases, the dump will be copied to the /var/adm/ras directory. In this case, use /var/adm/ras/vmcore.x instead of /dev/hd#. If, when your system started, you were prompted to put the dump on tape, extract the dump from tape to a directory--like /tmp/dump, for example--and then use /tmp/dump/dump_name instead of /dev/hd6.
If you have set up a dedicated dump device like /dev/dumplv, use /dev/dumplv in the following steps.
crash /dev/<hd#>To verify the usability of the dump information, look for the following output:
Using /unix as the default namelist file Reading in symbols.........
TRAP | An assert statement in the code caused the system to crash because it was not true. |
INVALID OPERATION | There is probably a wild branch, and the instruction to be executed is not valid. |
DSI (Data Storage Interrupt) | This means that there was an addressing exception on a data fetch. |
ISI (Instruction Storage Interrupt) | This means that there was an addressing exception on an instruction fetch. |
HANG | The system is hung and the user must force a system dump. |
The types of dumps can be differentiated by the following symptoms:
888 102 700 0c0 | This LED sequence indicates a trap or invalid operation (differentiated by errlog entry). |
888 102 300 0c0 | This LED sequence indicates a DSI. |
888 102 400 0c0 | This LED sequence indicates an ISI. |
No process can proceed and no external interrupts are accepted. | This indicates a hang. |
An example of the errlog entry follows:
ERROR LABEL: PROGRAM_INT ... Detail Data Segment Register, SEGREG 0000 0000 Machine Status Save/Restore Register 0 0009 AC50 Machine Status Save/Restore Register 1 0002 0000 Machine State Register, MSR 0002 90B0
Please note the Machine Status Save/Restore Register 0 (9AC50 in the preceding example). This is the address at which the trap occurred.
If you have a version of the crash command which accepts the -m option on the trace subcommand (trace -m), you can easily identify where the trap occurred.
Here is an example. Input is shown with the ">" crash command prompt:
> trace -m MST STACK TRACE: 0x1f8154 (excpt=0:0:0:0:0) (intpri=0) IAR: 00007148 .i_enable + ac: bcr 14,0 *LR: 013d87a4 [tokdd:tokoflv] + 294c 00000000: 00000000 <invalid> ... 0x2ff98000 (excpt=0:40000000:618f:2ff95508:106) (intpri=b) IAR: 0009ac50 .freeiblk + 54: ti 4,r7,0x0 *LR: 0009ac38 .freeiblk + 3c *2ff97288: 00000000 <invalid> 2ff97368: 0009b084 .ifreeind + 3c ...
Find the IAR line that contains the Machine Status Save/Restore Register 0 that was found in the errlog. The ti on the IAR: line indicates that there was a trap instruction in .freeiblk + 54.
An example of the errlog entry follows:
ERROR LABEL: PROGRAM_INT ... Detail Data Segment Register, SEGREG 0000 0000 Machine Status Save/Restore Register 0 000B E4F8 Machine Status Save/Restore Register 1 0008 0000 Machine State Register, MSR 0002 90B0
Please note the Machine Status Save/Restore Register 0 (BE4F8 in the preceding example). This is the address at which the invalid operation occurred.
If you have a version of crash which accepts the -m option on the trace subcommand (trace -m), you can easily identify an invalid operation.
Here is an example. Input is shown with the ">" crash command prompt:
> trace -m MST STACK TRACE: 0x1f8154 (excpt=0:0:0:0:0) (intpri=0) IAR: 00007148 .i_enable + ac: bcr 14,0 *LR: 013d87a4 [tokdd:tokoflv] + 294c 00000000: 00000000 <invalid> ... 0x2ff98000 (excpt=0:42000000:40001164:2ff7fffc:106) (intpri=b) IAR: 000be4f8 <invalid>: ??? (0x9872c) *LR: 000356ec .soo_ioctl + 634 *2ff979f8: 00000000 <invalid> 2ff97a58: 000a4d88 .fp_ioctl + 68 2ff97ab8: 01d00b90 sna_sysx:luxgosp + 9b90 2ff97d88: 01cfdde4 sna_sysx:luxioctl + 6de4 2ff97de8: 00082ba8 .rdevioctl + b0 2ff97e28: 00084054 .spec_ioctl + 20 2ff97ed8: 0009891c .vno_ioctl + 110 2ff97fa8: 000a4c84 .kioctl + e8
Find the IAR line that contains the Machine Status Save/Restore Register 0 that was found in the errlog.
The IAR is listed as invalid in the trace output; however, the rest of the stack is valid information.
An example of the errlog entry follows:
ERROR LABEL: DSI_PROC ... Detail Data Data Storage Interrupt Status Register 4000 0000 Data Storage Interrupt Address Register 007F FFFF Segment Register, SEGREG 632E 6108 EXVAL 0000 000E
If you have a version of the crash command which accepts the -m option on the trace subcommand (trace -m), you can easily identify where the DSI occurred.
Here is an example. Input is shown with the ">" crash command prompt:
> trace -m MST STACK TRACE: 0x211db0 (excpt=0:0:0:0:0) (intpri=0) IAR: 00009e00 .v_copypage_pwr + 58: dclz r7,r6 *LR: 0005be64 .getvmpage + 128 *00211bb8: 00000000 <invalid> IAR not in kernel segment. 0x2ff98000 (excpt=632e6108:40000000:7fffff:632e6108:106) (intpri=b) IAR: 0147ded4 [smt_load:smconnect] + cd4: l r0,0x8(r5) *LR: 0147de10 [smt_load:smconnect] + c10 *2ff97fa8: 00000000 <invalid> 00000000: 000036dc <invalid>
You can identify the correct IAR stanza by the first three values following excpt= on the line above the first and third appearances of IAR. In the third appearance, those three numbers are 632e6108:40000000:7fffff. Each of these three numbers can be found in the DSI_PROC error log entry: the first number is the Segment Register, SEGREG, the second number is the Data Storage Interrupt Status Register, and the third number is the Data Storage Interrupt Address Register.
In the case of an ISI_PROC error log entry, the first number is the Segment Register, SEGREG, the second number is the ISISR, and the third number is the ISIR0. See the section on Instruction Storage Interrupts.
A DSI (or ISI) also shows up in the vmmerrlog. The vmmerrlog information can be seen in the detail data section of the DSI_PROC (or ISI_PROC) error report entry:
EXVAL 0000 000E
You can also view the information with crash subcommands. Here is an example. Input is shown with the ">" crash command prompt:
> od vmmerrlog 9 a 00056a20: 20faed7f 53595356 4d4d2000 00000000 | .."SYSVMM .....| 00056a30: 00000000 40000000 007fffff 632e6108 |....@...."..c.a.| 00056a40: 0000000e |....|
In this case, the return code from VMM is 0000000e. The
DSISR in the preceding example is 40000000.
Check the error log for disk or SCSI errors. If there are
disk or SCSI errors for disks that are not part of rootvg
and do not contain paging space, contact AIX defect support
and send in the crash information.
Otherwise, contact a customer engineer to run diagnostics.
An example of the errlog entry:
Refer to the section Data Storage Interrupts in this
document. You can find the correct IAR stanza for an ISI in
the same way as for a DSI.
Refer to the section Forcing a system dump
for details on how to force a dump.
NOTE:
If you work with AIX support on your problem, it is
useful to describe in detail the conditions and events
that led to the hang. For example, "After running XYZ
application for two hours, there is no response to the
keyboard input. I cannot get a response at the system
console or at any terminals directly connected to the
system unit, and I cannot rlogin or ping the system."
In a dump that was forced
because of a hang, locks are one thing to look at. A few locks are:
View lock information with crash subcommands. The
following example uses proc_lock; the same format can be
used for kernel_lock or net_lock. (Input is shown with the
">" crash command prompt.)
For AIX Version 3.2 enter:
For AIX Versions 4.x enter:
If what is returned consists of a series of f's,
no process is holding that lock.
If there is a process holding the lock, the process ID will be in the field
occupied by ffffffff in the preceding example.
You can look at the locks to see if there is a
dead-lock situation. If there is NOT a dead-lock situation, you
can look at the kernel stack trace of any process or of the
running process to attempt to determine what caused the
system to hang.
NOTE: At AIX Version 4.x you can use the dlock subcommand
to detect dead-lock situations.
Look at the NAME column to find a process in
which you are interested. Use trace -k
process_slot_number to display its kernel stack
trace, if any exists.
For example:
NOTE: In AIX Version 4.x the pcb subcommand is
replaced with the tcb subcommand.
In the preceding example, the curid value on
the third line of output is 0x00003236. Obtain the process
slot number from the curid by shifting the
curid eight bits to the right. In this case, the
process slot number is 0x32. Convert that value from
hexadecimal to decimal; in this case, 0x32
(hexadecimal) = 50 (decimal). The process slot number
is 50 (decimal).
If the user cannot telnet, rlogin, or ping to the
system, it indicates a hung system. Another
indication is if the user can ping the system but
the rest of the system is unavailable.
Chances are the system will hang again. The steps below
will prepare the system for a forced dump when and if this
event recurs.
Run the following command:
When the system hangs again, proceed according to the type of system:
Unless the default dump configuration has been modified
with the sysdumpdev command, the dump will be copied to
/var/adm/ras/vmcore.x when the system is powered on.
NOTE: If /var is too small to hold the dump, the system
will prompt the user to copy the dump to external media
such as tape or preformatted diskettes. If a tape
drive is not connected, the system will prompt for
diskettes. Using diskettes is NOT a recommended method
of collecting a dump. If you are unable to save the
dump, AIX support will not be able to determine what caused your
system to crash.
[ Doc Ref: 90605199714606 Publish Date: Oct. 16, 2000 4FAX Ref: 1828 ]
Virtual Memory Manager (VMM) return codes and meanings
For all of the following codes except 00000005, there is
nothing the user can do to fix the problem; the crash
information must be analyzed.
0000000E - EFAULT
This is an efault. It comes from errno.h (14) and is
returned if you attempt to store to an invalid address.
fffffffa - Invalid Address Not in Memory
This is usually the result of a page fault. This code will be
returned if you try to access something that is paged out
while interrupts are disabled.
00000005 - I/O Error
This is a hardware problem. An I/O error occurred when you
tried to page in or out, or you tried to access a memory mapped
file and could not do it.
00000086 - Protection Exception
This means that you tried to store to a location that is
protected. This is usually caused by low kernel memory.
0000001C - NO PAGING SPACE
This means that the system has exhausted its paging space.
DSISR - Data Storage Interrupt Status Register
The values for the DSISR are in /usr/include/sys/machine.h.
In the preceding example, the DSISR is 40000000, which indicates
a page fault.
Instruction Storage Interrupts
An ISI (Instruction Storage Interrupt) is the cause of the
dump if the system crashed with LED sequence 888 102 400
0c0 and if the errlog (checked with errpt -a) contains an
entry with ERROR LABEL: ISI_PROC.
ERROR LABEL: ISI_PROC
...
Detail Data
ISISR
4000 0000
ISIR0
007F FFFF
Segment Register, SEGREG
632E 6108
EXVAL
0000 000E
Hangs
The system is hung if no process can proceed and no external
interrupts are accepted. If the system can receive a ping from
another node on the network, the system is not hung;
instead, the application or screen may be locked up.
proc_lock
kernel_lock
net_lock
> od proc_lock
0002a728: ffffffff
> lock
Viewing the kernel stack trace of any process
> p
SLT ST PID PPID PGRP UID EUID PRI CPU EVENT NAME
0 s 0 0 0 0 0 16 120 swapper
FLAGS: swapped_in no_swap fixed_pri kproc wake/sig
1 s 1 0 0 0 0 60 0 init
FLAGS: swapped_in no_swap wake/sig locks
2 r 202 0 0 0 0 127 120 wait
FLAGS: swapped_in no_swap wake/sig locks
...
12 s cf8 1 662 0 0 60 0 014d6280 cdpg
FLAGS: swapped_in kproc orphanpgrp
...
> trace -k 12
STACK TRACE:
3320 (excpt=04fa5654:40000000:00000000:04fa5654:00000106) (intpri=0)
IAR: .e_wait+15c (0003f384): cror 15,cr15,cr15
LR: .e_wait+15c (0003f384)
2ff7fec0: .e_sleep+120 (0003f674)
2ff7ff20: .[cfs.ext:cdr_pager]+ac (014d5720)
2ff7ff70: .procentry+1c (00032374)
2ff7ffc0: INVALID (00000000)
Viewing the kernel stack trace of the running process
> pcb
USER AREA FOR X (ProcTable Address 0xe3003200)
SAVED MACHINE STATE
curid:0x00003236 m/q:0x00040000 iar:0x014973dc cr:0x48844084
msr:0x000090b0 lr:0x0149712c xer:0x00000004
ctr:0x0008bf8c *bus:0x04fc13c0
*prevmst:0x00000000 *stackfix:0x00000000 intpri:0x0000000b
backtrace:0x00 tid:0x00000000 fpeu:0x01 ecr:0x00000087
.... rest of pcb output deleted ...
> trace -k 50
Forcing a system dump
If the system does not respond to mouse or keypad input, then
it is in a hung state.
Preparing for a forced dump
NOTE: In AIX Version 4.1.4 and later versions, a system dump can be
forced WITHOUT a key switch. The system needs to be
initially configured to use this method. This can be done
through SMIT by following the fast path.
smit dump
Change the Always Allow System Dump attribute to TRUE.
System with LED or Non LED
display with NO KEY SWITCH
key and LED machine and RESET button
------------------- with AIX 4.1.4 and beyond
| ----------------------------
Turn the key to service. |
| Hit the following key
Hit reset. sequence if there is no
| disk activity:
| |
The LED sequence will be Ctrl-Alt-1 (on the numpad).
0c9-0c4 or 0c9-0c0. Wait for disk activity
| to stop.
| |
If a hang occurs, power off the system and proceed as shown
below.
Connect a tape drive to the system and power the system
on.