Wednesday, January 18, 2006

AIX crash dumps

These are supposed to be my working notes on Crash dump analysis on AIX.

Step 1:
OK, so when the system panics, you will hear the periodic beeps typical of the RS/6000 (if you are sitting close by). The beeps would go on as the machine dumps core, and if you do not want a core and have a slow machine, you could better restart by pressing the system reset button on the machine

[ There is a Step 0: where you make the system panic, but the details are left as an excercise to the reader :). See sysdumpstart(1) for more details. ]

Step 2:
When the system now boots, it may prompt you about saving the core dump before going into normal boot process. If you have a place to save ( a spare partition) , do save the dump here, or else press 99 to continue.

[ All of this depends on your configuration. Running sysdumpdev will allow you to examine and alter your dump settings. The Primary dump device is /dev/hd6 by default, which is also the AIX paging volume. Other settings include compression and ... (well, RTFM) ]

Step 3: (If you pressed 99 in last step, or else skip)
Now when the system is up, you use savecore -f /directory_path to save the core in a specific place. This would leave you a compressed .Z file, and a kernel image.

Step 4:
Uncompress the core using gunzip.

Step 5:
Run kdb as

# kdb vmcore.1 vmunix.1

And you would get something like this from kdb in return

19) mirdd [5 entries]
20) kbddd [2 entries]
21) mousedd [2 entries]
Component Dump Table has 913 entries
0000000000001000 0000000002147040 start+000FD8
000000002FF3B400 000000002FF80A98 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
00000000E0000000 00000000F0000000 sys_resource+000000
raddr.....0000000000C00000 eaddr.....0000000000C00000
size..............00000000 align.............00000000
valid..1 ros....0 fixlmb.1 seg....1 wimg...2

raddr.....0000000001000000 eaddr.....0000000001000000
size..............00000000 align.............00000000
valid..1 ros....0 fixlmb.1 seg....1 wimg...2
Dump analysis on POWER_PC POWER_604 machine with 1 available CPU(s) (32-bit registers)
Processing symbol table...

Step 6:
Now we would see what really caused the panic. Run stat, and you will get some info on the machine and the core

(0)> stat
POWER_PC POWER_604 machine with 1 available CPU(s) (32-bit registers)

sysname... AIX
nodename.. fundu
release... 3
version... 5
build date Apr 10 2005
build time 21:52:04
label..... yes
machine... 00081AAA4C00
nid....... 081AAA4C
time of crash: Wed Jan 18 04:29:05 2006
age of system: 15 hr., 52 min., 38 sec.
xmalloc debug: disabled

And the real thing, a backtrace of the function call

CPU 0 CSA 2FF3B400 at time of crash, error code for LEDs: 30000000
pvthread+004D80 STACK:
[0000A5E0].test_and_set+000020 ()
[00215020]slock_ppc+000320 (??, ??)
[00009554].simple_lock+000054 ()
[001F52A4]j2_rename+000140 (??, ??, ??, ??, ??, ??, ??)
[029CC17C]sc_vop_rename+000088 (??, ??, ??, ??, ??, ??, ??)
[002F8D54]vnop_rename+0000DC (??, ??, ??, ??, ??, ??, ??)
[0033C4C4]rename+00035C (2FF22DD3, 2FF22DE2)
[00003AD8].sys_call+000000 ()
[kdb_get_memory] no real storage @ 2FF226C8

There are more steps, which would be added after I am done with the debugging of the core at hand.

[ As a sidenote: The stat command is also the AIX equivalent of dmesg, which you would certainly miss, if you have used other unixes, such as Linux and HP-UX. It would show you the last few kernel printfs, still in memory, and that includes messages generated using bsdlog ]


Anonymous said...

what ur trying debug?

thruput said...

Nothing in particular. I write kernel modules for AIX, and often get crashes to debug, of my own making.

Anonymous said...

FYI : this site looks like binary content in IE 6 :)

Anonymous said...

"CPU 0 CSA 2FF3B400 at time of crash, error code for LEDs: 30000000 "

So what can one infer from this CRASH INFO, finally? What is CSA?

Mike said...