AIX PowerHA Node DOWN Do not plug both adapters on same network switch
id : dawg2dfqj3
category : computer
blog : unix
created : 04/18/12 - 18:13:43

Problem
One of our cluster node was down. This shutdown was not a human action, or human error.
Analysis
  • here are errpt entries showing the problem (C69F5C9B, 9DEC29E1, EC0BCCD4). Three of them can explain this problem.
# errpt | more
F3931284   0412224712 I H ent1           ETHERNET NETWORK RECOVERY MODE
C69F5C9B   0412224612 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
6D19271E   0412224612 I O topsvcs        Topology Services daemon stopped
AA8AB241   0412224612 T O OPERATOR       OPERATOR NOTIFICATION
BC3BE5A3   0412224612 P S SRC            SOFTWARE PROGRAM ERROR
BC3BE5A3   0412224612 P S SRC            SOFTWARE PROGRAM ERROR
CB4A951F   0412224612 I S SRC            SOFTWARE PROGRAM ERROR
12081DC6   0412224612 P S haemd          SOFTWARE PROGRAM ERROR
9DEC29E1   0412224612 P O grpsvcs        Group Services daemon exit to merge doma
F3931284   0412224612 I H ent0           ETHERNET NETWORK RECOVERY MODE
173C787F   0412224012 I S topsvcs        Possible malfunction on local adapter
173C787F   0412224012 I S topsvcs        Possible malfunction on local adapter
EC0BCCD4   0412223912 T H ent0           ETHERNET DOWN
EC0BCCD4   0412223912 T H ent1           ETHERNET DOWN

    • first two entries are ETHERNET DOWN (EC0BCCD4).
    • second entrie is Group Services daemon exit to merge domain (9DEC29E1).
    • last entrie is SOFTWARE PROGRAM ABNORMALLY TERMINATED (C69F5C9B).
  • last entrie is in fact a cluster manager process (clstrmgr) CORE_DUMP.
# errpt -a -j C69F5C9B | more
---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     C69F5C9B

Date/Time:       Thu Apr 12 22:46:59 2012
Sequence Number: 203279
Machine Id:      00C8A6104C00
Node Id:         proas6c2
Class:           S
Type:            PERM
Resource Name:   SYSPROC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
           6
USER'S PROCESS ID:
              1962144
FILE SYSTEM SERIAL NUMBER
           4
INODE NUMBER
           0         396
CORE FILE NAME
/var/hacmp/core
PROGRAM NAME
clstrmgr
STACK EXECUTION DISABLED
           0
COME FROM ADDRESS REGISTER

PROCESSOR ID
  hw_fru_id: N/A
  hw_cpu_id: N/A

ADDITIONAL INFORMATION
pthread_k 88
??
_p_raise 8C
raise 30
abort B8
die__Fi 5A8
announcem 330
kill_grp_ 158
ha_gs_dis 2E24
ha_gs_dis 50
DoMainLoo 768
main 804
__start 9C

Symptom Data
REPORTABLE
1
INTERNAL ERROR
1
SYMPTOM CODE
PIDS/5765E6200 LVLS/520 PCSS/SPI2 FLDS/clstrmgr SIG/6 FLDS/die__Fi VALU/5a8

  • Take a look on PROGRAM NAME, clstrmgr is the faulty process.
  • You can have a look in clstrmgr log file (/var/hacmp/log/clstrmgr.debug) :
# cd /var/hacmp/log
# tail -5 clstrmgr.debug.1
Thu Apr 12 22:46:57 announcementCb: GsToken 2, AdapterToken 3, rm_GsToken 1
Thu Apr 12 22:46:57 announcementCb: GRPSVCS announcment code=512; exiting
Thu Apr 12 22:46:57 CHECK FOR FAILURE OF RSCT SUBSYSTEMS (topsvcs or grpsvcs)
Thu Apr 12 22:46:57 die: clstrmgr on node 2 is exiting with code 4

  • Last two entries are most interessant :
    • why clstrmgr has CORE_DUMP : it's because an rsct subsystem was down, thus grpsvcs was down.
    • clstrmgr is exiting.
  • If you can have deeper look on errpt, check the entrie GS_DOM_MERGE_ER
# errpt -a -j 9DEC29E1 | more
---------------------------------------------------------------------------
LABEL:          GS_DOM_MERGE_ER
IDENTIFIER:     9DEC29E1

Date/Time:       Fri Apr 13 00:06:18 2012
Sequence Number: 15236
Machine Id:      00C8A6104C00
Node Id:         proas8c2
Class:           O
Type:            PERM
Resource Name:   grpsvcs

Description
Group Services daemon exit to merge domains

Probable Causes
Network between two node groups has repaired

Failure Causes
Network communication has been blocked.
Topology Services has been partitioned.

        Recommended Actions
        Check the network connection.
Check the Topology Services.
Verify that Group Services daemon has been restarted
Call IBM Service if problem persists

Detail Data
DETECTING MODULE
RSCT,NS.C,1.107.1.49,4461
ERROR ID
6Vb0vR0O5pVD/wap09...4....................
REFERENCE CODE

DIAGNOSTIC EXPLANATION
NS::Ack(): The master requests to dissolve my domain because of the merge with other domain 1.15

  • Network communication has been blocked resulting in :
    • an exit of grpsvc,
    • then a clstrmgr CORE_DUMP,
    • then a node halt.
If the cluster manager exits abnormally, a machine will typically halt. The
majority of the time, some type of an exit message will be logged at the end of
this file. The message can give you or your support representatives an idea
as to the cause of the failure.
The AIX® resource controller subsystem monitors the cluster manager daemon process. If the controller detects that the Cluster Manager daemon has exited abnormally (without being shut down using the clstop command), it executes the /usr/es/sbin/cluster/utilities/clexit.rc script to halt the system. This prevents unpredictable behavior from corrupting the data on the shared disks.

  • ok, so what's in /usr/es/sbin/cluster/utilities/clexit.rc
[..]
# Do a sync, then a short sleep to attempt to flush the messages
# we just logged to disk, and allow background processes to complete.
# Because the secondary node will start taking over the resources
# very quickly, we can't wait indefinitely.  This node must be halted
# to avoid conflict over the resources.
sync &
sleep 2
# halt the node
[[ "$PLATFORM" = "__AIX__" ]] && halt -q
[..]

  • So is this halt -q normal ?
Solution
Yes, this halt is a normal PowerHA behaviour. Problem was not a problem.
Anyway, if you really want to change this behaviour (not recommanded), you can do it editing /etc/cluster/hacmp.term
Editing the /etc/cluster/hacmp.term file to change the default action after
an abnormal exit. The clexit.rc script checks for the presence of this file
and, if the file is executable, the script calls it instead of halting the
system automatically.

After checking with network teams both adapters were plugged on same network switch. Having network adapters plugged on differents switchs will avoid this problem.
Do not forget to run test cases before going in production with PowerHA.