AIX PowerHA Node DOWN Do not plug both adapters on same network switch
ProblemOne of our cluster node was down. This shutdown was not a human action, or human error.
- here are errpt entries showing the problem (C69F5C9B, 9DEC29E1, EC0BCCD4). Three of them can explain this problem.
# errpt | more F3931284 0412224712 I H ent1 ETHERNET NETWORK RECOVERY MODE C69F5C9B 0412224612 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED 6D19271E 0412224612 I O topsvcs Topology Services daemon stopped AA8AB241 0412224612 T O OPERATOR OPERATOR NOTIFICATION BC3BE5A3 0412224612 P S SRC SOFTWARE PROGRAM ERROR BC3BE5A3 0412224612 P S SRC SOFTWARE PROGRAM ERROR CB4A951F 0412224612 I S SRC SOFTWARE PROGRAM ERROR 12081DC6 0412224612 P S haemd SOFTWARE PROGRAM ERROR 9DEC29E1 0412224612 P O grpsvcs Group Services daemon exit to merge doma F3931284 0412224612 I H ent0 ETHERNET NETWORK RECOVERY MODE 173C787F 0412224012 I S topsvcs Possible malfunction on local adapter 173C787F 0412224012 I S topsvcs Possible malfunction on local adapter EC0BCCD4 0412223912 T H ent0 ETHERNET DOWN EC0BCCD4 0412223912 T H ent1 ETHERNET DOWN
- first two entries are ETHERNET DOWN (EC0BCCD4).
- second entrie is Group Services daemon exit to merge domain (9DEC29E1).
- last entrie is SOFTWARE PROGRAM ABNORMALLY TERMINATED (C69F5C9B).
- last entrie is in fact a cluster manager process (clstrmgr) CORE_DUMP.
# errpt -a -j C69F5C9B | more --------------------------------------------------------------------------- LABEL: CORE_DUMP IDENTIFIER: C69F5C9B Date/Time: Thu Apr 12 22:46:59 2012 Sequence Number: 203279 Machine Id: 00C8A6104C00 Node Id: proas6c2 Class: S Type: PERM Resource Name: SYSPROC Description SOFTWARE PROGRAM ABNORMALLY TERMINATED Probable Causes SOFTWARE PROGRAM User Causes USER GENERATED SIGNAL Recommended Actions CORRECT THEN RETRY Failure Causes SOFTWARE PROGRAM Recommended Actions RERUN THE APPLICATION PROGRAM IF PROBLEM PERSISTS THEN DO THE FOLLOWING CONTACT APPROPRIATE SERVICE REPRESENTATIVE Detail Data SIGNAL NUMBER 6 USER'S PROCESS ID: 1962144 FILE SYSTEM SERIAL NUMBER 4 INODE NUMBER 0 396 CORE FILE NAME /var/hacmp/core PROGRAM NAME clstrmgr STACK EXECUTION DISABLED 0 COME FROM ADDRESS REGISTER PROCESSOR ID hw_fru_id: N/A hw_cpu_id: N/A ADDITIONAL INFORMATION pthread_k 88 ?? _p_raise 8C raise 30 abort B8 die__Fi 5A8 announcem 330 kill_grp_ 158 ha_gs_dis 2E24 ha_gs_dis 50 DoMainLoo 768 main 804 __start 9C Symptom Data REPORTABLE 1 INTERNAL ERROR 1 SYMPTOM CODE PIDS/5765E6200 LVLS/520 PCSS/SPI2 FLDS/clstrmgr SIG/6 FLDS/die__Fi VALU/5a8
- You can have a look in clstrmgr log file (/var/hacmp/log/clstrmgr.debug) :
# cd /var/hacmp/log # tail -5 clstrmgr.debug.1 Thu Apr 12 22:46:57 announcementCb: GsToken 2, AdapterToken 3, rm_GsToken 1 Thu Apr 12 22:46:57 announcementCb: GRPSVCS announcment code=512; exiting Thu Apr 12 22:46:57 CHECK FOR FAILURE OF RSCT SUBSYSTEMS (topsvcs or grpsvcs) Thu Apr 12 22:46:57 die: clstrmgr on node 2 is exiting with code 4
- Last two entries are most interessant :
- why clstrmgr has CORE_DUMP : it's because an rsct subsystem was down, thus grpsvcs was down.
- clstrmgr is exiting.
- If you can have deeper look on errpt, check the entrie GS_DOM_MERGE_ER
# errpt -a -j 9DEC29E1 | more --------------------------------------------------------------------------- LABEL: GS_DOM_MERGE_ER IDENTIFIER: 9DEC29E1 Date/Time: Fri Apr 13 00:06:18 2012 Sequence Number: 15236 Machine Id: 00C8A6104C00 Node Id: proas8c2 Class: O Type: PERM Resource Name: grpsvcs Description Group Services daemon exit to merge domains Probable Causes Network between two node groups has repaired Failure Causes Network communication has been blocked. Topology Services has been partitioned. Recommended Actions Check the network connection. Check the Topology Services. Verify that Group Services daemon has been restarted Call IBM Service if problem persists Detail Data DETECTING MODULE RSCT,NS.C,220.127.116.11,4461 ERROR ID 6Vb0vR0O5pVD/wap09...4.................... REFERENCE CODE DIAGNOSTIC EXPLANATION NS::Ack(): The master requests to dissolve my domain because of the merge with other domain 1.15
- Network communication has been blocked resulting in :
- an exit of grpsvc,
- then a clstrmgr CORE_DUMP,
- then a node halt.
- Check redbook sg247739 on page 264 http://www.redbooks.ibm.com/abstracts/sg247739.html
If the cluster manager exits abnormally, a machine will typically halt. The majority of the time, some type of an exit message will be logged at the end of this file. The message can give you or your support representatives an idea as to the cause of the failure. The AIX® resource controller subsystem monitors the cluster manager daemon process. If the controller detects that the Cluster Manager daemon has exited abnormally (without being shut down using the clstop command), it executes the /usr/es/sbin/cluster/utilities/clexit.rc script to halt the system. This prevents unpredictable behavior from corrupting the data on the shared disks.
- ok, so what's in /usr/es/sbin/cluster/utilities/clexit.rc
[..] # Do a sync, then a short sleep to attempt to flush the messages # we just logged to disk, and allow background processes to complete. # Because the secondary node will start taking over the resources # very quickly, we can't wait indefinitely. This node must be halted # to avoid conflict over the resources. sync & sleep 2 # halt the node [[ "$PLATFORM" = "__AIX__" ]] && halt -q [..]
- So is this halt -q normal ?
SolutionYes, this halt is a normal PowerHA behaviour. Problem was not a problem.
Anyway, if you really want to change this behaviour (not recommanded), you can do it editing /etc/cluster/hacmp.term
- Check redbook sg247739 on page 320 http://www.redbooks.ibm.com/abstracts/sg247739.html
Editing the /etc/cluster/hacmp.term file to change the default action after an abnormal exit. The clexit.rc script checks for the presence of this file and, if the file is executable, the script calls it instead of halting the system automatically.
After checking with network teams both adapters were plugged on same network switch. Having network adapters plugged on differents switchs will avoid this problem.
Do not forget to run test cases before going in production with PowerHA.