Wednesday, January 18, 2006

AIX Crash dumps - II

Once you have a crash dump, there are several things you might like to do. If you are fiddling with filesystems, for example, you would like to be able to print vnodes and gnodes.

In kdb, you first need to tell kdb the header that defines your structure. In case of struct vnode, its sys/vnode.h. So we invoke kdb as

# kdb -i /usr/include/sys/vnode.h

and issue the print command

0>print vnode address

Which should print something like

struct vnode {
ushort v_flag = 0x0000;
ushort v_flag2 = 0x0000;
ulong32int64_t v_count = 0x00000000;
int v_vfsgen = 0x00000000;
union Simple_lock {
simple_lock_data _slock = 0x00000000;
struct lock_data_instrumented *_slockp = 0x00000000;
} v_lock;
struct vfs *v_vfsp = 0x31349808;
struct vfs *v_mvfsp = 0x00000000;
struct gnode *v_gnode = 0x13C823E0;
struct vnode *v_next = 0x00000000;
struct vnode *v_vfsnext = 0x13987F38;
struct vnode *v_vfsprev = 0x13D1AAE8;
union v_data {
void *_v_socket = 0x00000000;
struct vnode *_v_pfsvnode = 0x00000000;
} _v_data;
char *v_audit = 0x00000000;
} foo[0];

AIX crash dumps

These are supposed to be my working notes on Crash dump analysis on AIX.

Step 1:
OK, so when the system panics, you will hear the periodic beeps typical of the RS/6000 (if you are sitting close by). The beeps would go on as the machine dumps core, and if you do not want a core and have a slow machine, you could better restart by pressing the system reset button on the machine

[ There is a Step 0: where you make the system panic, but the details are left as an excercise to the reader :). See sysdumpstart(1) for more details. ]

Step 2:
When the system now boots, it may prompt you about saving the core dump before going into normal boot process. If you have a place to save ( a spare partition) , do save the dump here, or else press 99 to continue.

[ All of this depends on your configuration. Running sysdumpdev will allow you to examine and alter your dump settings. The Primary dump device is /dev/hd6 by default, which is also the AIX paging volume. Other settings include compression and ... (well, RTFM) ]

Step 3: (If you pressed 99 in last step, or else skip)
Now when the system is up, you use savecore -f /directory_path to save the core in a specific place. This would leave you a compressed .Z file, and a kernel image.

Step 4:
Uncompress the core using gunzip.

Step 5:
Run kdb as

# kdb vmcore.1 vmunix.1

And you would get something like this from kdb in return

19) mirdd [5 entries]
20) kbddd [2 entries]
21) mousedd [2 entries]
Component Dump Table has 913 entries
0000000000001000 0000000002147040 start+000FD8
000000002FF3B400 000000002FF80A98 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
00000000E0000000 00000000F0000000 sys_resource+000000
raddr.....0000000000C00000 eaddr.....0000000000C00000
size..............00000000 align.............00000000
valid..1 ros....0 fixlmb.1 seg....1 wimg...2

raddr.....0000000001000000 eaddr.....0000000001000000
size..............00000000 align.............00000000
valid..1 ros....0 fixlmb.1 seg....1 wimg...2
Dump analysis on POWER_PC POWER_604 machine with 1 available CPU(s) (32-bit registers)
Processing symbol table...

Step 6:
Now we would see what really caused the panic. Run stat, and you will get some info on the machine and the core

(0)> stat
POWER_PC POWER_604 machine with 1 available CPU(s) (32-bit registers)

sysname... AIX
nodename.. fundu
release... 3
version... 5
build date Apr 10 2005
build time 21:52:04
label..... yes
machine... 00081AAA4C00
nid....... 081AAA4C
time of crash: Wed Jan 18 04:29:05 2006
age of system: 15 hr., 52 min., 38 sec.
xmalloc debug: disabled

And the real thing, a backtrace of the function call

CPU 0 CSA 2FF3B400 at time of crash, error code for LEDs: 30000000
pvthread+004D80 STACK:
[0000A5E0].test_and_set+000020 ()
[00215020]slock_ppc+000320 (??, ??)
[00009554].simple_lock+000054 ()
[001F52A4]j2_rename+000140 (??, ??, ??, ??, ??, ??, ??)
[029CC17C]sc_vop_rename+000088 (??, ??, ??, ??, ??, ??, ??)
[002F8D54]vnop_rename+0000DC (??, ??, ??, ??, ??, ??, ??)
[0033C4C4]rename+00035C (2FF22DD3, 2FF22DE2)
[00003AD8].sys_call+000000 ()
[kdb_get_memory] no real storage @ 2FF226C8

There are more steps, which would be added after I am done with the debugging of the core at hand.

[ As a sidenote: The stat command is also the AIX equivalent of dmesg, which you would certainly miss, if you have used other unixes, such as Linux and HP-UX. It would show you the last few kernel printfs, still in memory, and that includes messages generated using bsdlog ]