Working with TAC but looking for anyone else who may experienced what I am seeing.
We currently 1 3400 (lets call it M3-RAP1 serving about 150 RAPS (50/50 mix of RAP 2s and RAP 3s) as a primary LMS and our master controller (which also serves our wireless environment, call it N8-CON1). We ordered 2 new 3400s (call them M4-RAP1 and M4-RAP2) to set up as a cluster and do carrier redundancy through BGP, eventually we will take the existing 3400 (M3-RAP1) and put it in DR.
I had to get the configuration from the existing cluster over to the new cluster (M4-RAP1 and M4-RAP2) so I added them as local controllers to N8-CON1. I then detached them and created the new cluster. I created a new profile with the VRRP address of the new cluster as the primary LMS and the IP, which is assigned right on the physical interface, M3-RAP1 as the backup LMS. For licensing reasons I upgraded the new cluster to 6.3.1.8 whereas the old cluster is on 6.2.1.7. I didn’t want to install 2 sets of 150 licenses on each controller when we have plans to migrate to 6.3 in the near future anyway.
With all basic connectivity in place and the configurations matched up (including the whitelists) I changed the profile of a RAP from the CLI. The change took, the RAP connected to the new controller and started upgrading, then rebooted the went down. I have also seen this behavior on the old cluster. It seems to bounce back and forth. The RAP shows up in the AP database on the new cluster so I know it is connecting, it just wont come up on the new cluster.
If I hard reset the RAP and enter the LMS of the new cluster it connects no issue. I started looking at the configuration of the profile and child objects and noticed a few small inconsistencies and fixed them. To fully test this I created a brand new profile and only changed the LMS and backup LMS IPs. I also tried removing the backup LMS to make sure it just connected to new cluster. An interesting behavior I noticed here, even though the backup LMS was not defined in the system profile, the RAP connected back to the original LMS. It continues to go through the upgrade, reboot down cycle but it still found it’s way back. I tried changing the LMS IP (without backup) to the IP on M4-RAP1 to eliminate the VRRP as an issue, no dice. I also changed the LMS IP to the public IP of each respective controller. For example, on M4-RAP1 the LMS is 1.2.3.4 for the TEST AP group and it is 4.3.2.1. for M3-RAP1. Since the clusters are separate there is no sync, I have been careful to perfectly replicate any changes on either cluster to the other.
I can easily (from an engineers perspective) fix this problem by resetting the RAP. The problem is that they are deployed to non-technical individuals at their homes and our track record is not great right now so we need a zero touch solution.
Thanks in advance!