NetScaler High Availability – Sync FAILED

Sometimes sh!t happens!  Today I’d like show you how to troubleshoot a NetScaler High Availability (HA) Sync Failure issues. Resolution – right at the bottom of the article 🙂

Issue: On the NetScaler GUI | System | High Availability | “Synchronization State FAILED”

There could be more than one reason as to why this happens…

  1. Layer 2 Problems (MAC Moves, or Loops) – A server guy hooking up the appliance to switches with 2 cables in a redundant way without turning ON Link Aggregation (ether channels).
  2. RPC Node Passwords mismatch –  NetScaler uses RPCNode passwords for system-to-system communication. This doesn’t need to be same as your NSROOT password. You could also use a system generated one.
  3. Tagging NSVLAN and forgetting to do the same on Switch Side – this will lead to drop of Heart Beat packets!!
  4. NetScaler HA Synchronization Service isn’t running (NSNETSVC)
  5. HA Ports blocked – UDP Ports 3003, Non-Secure TCP Ports 3010 or Secure TCP Port 3008 –  Follow Citrix article for more info http://support.citrix.com/article/CTX109687.  Communication initiates from NSIP not SNIP for HA Packets.
  6. Appliance Firmware and Model Mismatch!

Troubleshooting (Some common culprits)

NetScaler HA Synchronization Service isn’t running (NSNETSVC)?

Here i’m looking at the running process (or daemons) for nsnetsvc and as you can see it returns a successful value.

 root@ns# ps auwx | grep -i nsnetsvc
root       1083  0.0  0.3 63068 26316  ??  Ss    9Feb15 830:29.25 /netscaler/nsn           etsvc -S -C
root      13897  0.0  0.0  9096  1136   0  S+    1:18PM   0:00.00 grep -i nsnets           vc

Are the TCP/UDP Ports blocked? Use NSTCPDUMP

I’m running a NSTCPDUMP to the secondary HA node, to confirm if the ports can be reached. You could telnet as well (since NSIP is used for HA;  Remember SNIP is used for back-end services)

root@ns# nstcpdump.sh -c 8 -nn host 10.99.99.17
reading from file -, link-type EN10MB (Ethernet)

13:42:56.053129 IP 10.99.99.16.3003 > 10.99.99.17.3003: UDP, length 272
13:42:56.053130 IP 10.99.99.16.3003 > 10.99.99.17.3003: UDP, length 272
13:42:56.053130 IP 10.99.99.16.3003 > 10.99.99.17.3003: UDP, length 272
13:42:56.112268 IP 10.99.99.17.3003 > 10.99.99.16.3003: UDP, length 272
13:42:56.112299 IP 10.99.99.17.3003 > 10.99.99.16.3003: UDP, length 272
13:42:56.112328 IP 10.99.99.17.3003 > 10.99.99.16.3003: UDP, length 272
13:42:56.253119 IP 10.99.99.16.3003 > 10.99.99.17.3003: UDP, length 272
13:42:56.253119 IP 10.99.99.16.3003 > 10.99.99.17.3003: UDP, length 272

Note: My tcpdump filters are -C (how many packets), -nn (display all port and Ip addresses in numerical form) and host (to specify a Destination Node i.e. Secondary HA Node). You can also combine “host XX.XX.XX.XX and port 3003”

As you can see from the dump, my nodes are able to communicate successfully using UDP 3003  and Telnet. Then what else could be the problem as my appliances are on the same firmware, no L2 Loops (LA channels exist), default NSVlAN etc… Could it be RPCNode Password??

 

Cause: RPCNode Password was invalid

Now, I drop into the shell and  view the auth.log’s  (cat /var/log/auth.log)

Apr 20 12:37:26  ns sshd[11510]: Accepted password for #nsinternal# from 10.99.99.17 port 18412 ssh2
Apr 20 12:37:26  ns sshd[11511]: Failed password for #nsinternal# from 10.99.99.17 port 37456 ssh2
Apr 20 12:37:26  ns sshd[11511]: Accepted password for #nsinternal# from 10.99.99.17 port 37456 ssh2
Apr 20 12:37:26  ns sshd[11510]: Received disconnect from 10.99.99.17: 11: disconnected by user
Apr 20 12:37:26  ns sshd[11511]: Received disconnect from 10.99.99.17: 11: disconnected by user
Apr 20 12:37:32  ns sshd[11519]: error: Invalid username or password
Apr 20 12:37:32  ns sshd[11520]: error: Invalid username or password
Apr 20 12:37:40  ns sshd[11531]: error: Invalid username or password
Apr 20 12:37:40  ns sshd[11532]: error: Invalid username or password

 

Resolution: Reset the RPCNode Password.

From NetScaler GUI | System | Network | RPC –  right-click on your NSIP (primary node) and type in a Password (or system can auto-generate)

Type in the same password for Remote HA node IP (on the primary appliance itself). Save configuration. Verify the HA Status (Synchronisation State should be SUCCESS)

Update 20/04/2015 – I’d to logon to Secondary Appliance as well and reset the RPC password as above (both NSIP and Remote Node IP ONLY)

 

Hope this helps in your journey of getting grips with NetScaler troubleshooting…

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s