Sometimes sh!t happens! Today I’d like show you how to troubleshoot a NetScaler High Availability (HA) Sync Failure issues. Resolution – right at the bottom of the article 🙂
Issue: On the NetScaler GUI | System | High Availability | “Synchronization State FAILED”
There could be more than one reason as to why this happens…
- Layer 2 Problems (MAC Moves, or Loops) – A server guy hooking up the appliance to switches with 2 cables in a redundant way without turning ON Link Aggregation (ether channels).
- RPC Node Passwords mismatch – NetScaler uses RPCNode passwords for system-to-system communication. This doesn’t need to be same as your NSROOT password. You could also use a system generated one.
- Tagging NSVLAN and forgetting to do the same on Switch Side – this will lead to drop of Heart Beat packets!!
- NetScaler HA Synchronization Service isn’t running (NSNETSVC)
- HA Ports blocked – UDP Ports 3003, Non-Secure TCP Ports 3010 or Secure TCP Port 3008 – Follow Citrix article for more info http://support.citrix.com/article/CTX109687. Communication initiates from NSIP not SNIP for HA Packets.
- Appliance Firmware and Model Mismatch!
Troubleshooting (Some common culprits)
NetScaler HA Synchronization Service isn’t running (NSNETSVC)?
Here i’m looking at the running process (or daemons) for nsnetsvc and as you can see it returns a successful value.
root@ns# ps auwx | grep -i nsnetsvc
root 1083 0.0 0.3 63068 26316 ?? Ss 9Feb15 830:29.25 /netscaler/nsn etsvc -S -C
root 13897 0.0 0.0 9096 1136 0 S+ 1:18PM 0:00.00 grep -i nsnets vc
Are the TCP/UDP Ports blocked? Use NSTCPDUMP
I’m running a NSTCPDUMP to the secondary HA node, to confirm if the ports can be reached. You could telnet as well (since NSIP is used for HA; Remember SNIP is used for back-end services)
root@ns# nstcpdump.sh -c 8 -nn host 10.99.99.17
reading from file -, link-type EN10MB (Ethernet)
13:42:56.053129 IP 10.99.99.16.3003 > 10.99.99.17.3003: UDP, length 272
13:42:56.053130 IP 10.99.99.16.3003 > 10.99.99.17.3003: UDP, length 272
13:42:56.053130 IP 10.99.99.16.3003 > 10.99.99.17.3003: UDP, length 272
13:42:56.112268 IP 10.99.99.17.3003 > 10.99.99.16.3003: UDP, length 272
13:42:56.112299 IP 10.99.99.17.3003 > 10.99.99.16.3003: UDP, length 272
13:42:56.112328 IP 10.99.99.17.3003 > 10.99.99.16.3003: UDP, length 272
13:42:56.253119 IP 10.99.99.16.3003 > 10.99.99.17.3003: UDP, length 272
13:42:56.253119 IP 10.99.99.16.3003 > 10.99.99.17.3003: UDP, length 272
Note: My tcpdump filters are -C (how many packets), -nn (display all port and Ip addresses in numerical form) and host (to specify a Destination Node i.e. Secondary HA Node). You can also combine “host XX.XX.XX.XX and port 3003”
As you can see from the dump, my nodes are able to communicate successfully using UDP 3003 and Telnet. Then what else could be the problem as my appliances are on the same firmware, no L2 Loops (LA channels exist), default NSVlAN etc… Could it be RPCNode Password??
Cause: RPCNode Password was invalid
Now, I drop into the shell and view the auth.log’s (cat /var/log/auth.log)
Apr 20 12:37:26 ns sshd[11510]: Accepted password for #nsinternal# from 10.99.99.17 port 18412 ssh2
Apr 20 12:37:26 ns sshd[11511]: Failed password for #nsinternal# from 10.99.99.17 port 37456 ssh2
Apr 20 12:37:26 ns sshd[11511]: Accepted password for #nsinternal# from 10.99.99.17 port 37456 ssh2
Apr 20 12:37:26 ns sshd[11510]: Received disconnect from 10.99.99.17: 11: disconnected by user
Apr 20 12:37:26 ns sshd[11511]: Received disconnect from 10.99.99.17: 11: disconnected by user
Apr 20 12:37:32 ns sshd[11519]: error: Invalid username or password
Apr 20 12:37:32 ns sshd[11520]: error: Invalid username or password
Apr 20 12:37:40 ns sshd[11531]: error: Invalid username or password
Apr 20 12:37:40 ns sshd[11532]: error: Invalid username or password
Resolution: Reset the RPCNode Password.
From NetScaler GUI | System | Network | RPC – right-click on your NSIP (primary node) and type in a Password (or system can auto-generate)
Type in the same password for Remote HA node IP (on the primary appliance itself). Save configuration. Verify the HA Status (Synchronisation State should be SUCCESS)
Update 20/04/2015 – I’d to logon to Secondary Appliance as well and reset the RPC password as above (both NSIP and Remote Node IP ONLY)
Hope this helps in your journey of getting grips with NetScaler troubleshooting…