Wireless Access

 View Only
last person joined: yesterday 

Access network design for branch, remote, outdoor, and campus locations with HPE Aruba Networking access points and mobility controllers.
Expand all | Collapse all

RAP Cluster Upgrade Best Practices

This thread has been viewed 34 times
  • 1.  RAP Cluster Upgrade Best Practices

    MVP
    Posted Sep 04, 2022 10:48 AM
    Hey AirHeads,

    We have a cluster of (4) Mobility Controllers supporting about 1,000 RAPs. We are currently running 8.6.0.5 and looking to upgrade to 8.6.0.18, but I understand that the traditional cluster live upgrade process is not support for RAPs. Given that, what is the recommended procedure for upgrading a RAP environment in AOS 8? My thought - 

    1. Upload new image to backup partition on Mobility Controllers
    2. Pre-load image on RAPs
    3. Reload 2 controllers, wait until they start pinging and reload the remaining 2 controllers
    4. RAPs will contact the original 2 controllers and identify the firmware change, will reload into the new firmware
    5. Second 2 controllers will come back online and once RAPs reload into 8.6.0.18, they will load balance normally

    Does this process sound correct, is there any changes to that plan that would be recommended - ie. bring all 4 controllers down at the same time vs. 2 at a time?

    Thanks all!

    ------------------------------
    Michael Haring
    ------------------------------


  • 2.  RE: RAP Cluster Upgrade Best Practices
    Best Answer

    EMPLOYEE
    Posted Sep 05, 2022 04:33 AM
    How do the RAPs 'discover' the controllers? Is that a DNS record pointing to all 4 public IP addresses?

    When you reboot the active anchor controller, the AP will switch to the standby anchor controller; if that has the old firmware it will keep working, if that controller has the upgraded firmware the AP should reboot in the pre-loaded firmware slot. If the AP then reboots and gets through round-robin DNS one of the old version controllers assigned, it will downgrade again. With that in mind, it sound like your plan to reboot the other 2 controllers as soon as the first two upgraded get online may have a slightly lower downtime; just make sure at no point when an AP reboots it will see an active controller with the old version.

    There will be downtime while the APs reboot anyway, so if the boot time of the controllers is less than that of the APs, there should not be a real difference system downtime. It just 'feels scary' to reload all the controllers at the same time, but reloading 2 at a time may be as scary and increases the total time for the upgrade, but if you have fast booting APs the downtime may be lower (or higher if you reboot both AAC and SAC for an AP as it will downgrade again on the remaining controllers before the upgraded controllers are up).

    You may ask TAC for a second opinion, but I would expect no significant difference in risk/downtime for either reloading 2/2 or 4 at a time; and in that case reloading all 4 may be easier/less complex to manage. If you have a global/DNS load balancer to assign the controller IPs, you may do something smart by creating 2 clusters and moving controllers/RAPs around better controlled by AP-group or geographical location. It also depends on how critical the deployment is, and how much downtime you can accept. Hope these thoughts help...

    ------------------------------
    Herman Robers
    ------------------------
    If you have urgent issues, always contact your Aruba partner, distributor, or Aruba TAC Support. Check https://www.arubanetworks.com/support-services/contact-support/ for how to contact Aruba TAC. Any opinions expressed here are solely my own and not necessarily that of Hewlett Packard Enterprise or Aruba Networks.

    In case your problem is solved, please invest the time to post a follow-up with the information on how you solved it. Others can benefit from that.
    ------------------------------



  • 3.  RE: RAP Cluster Upgrade Best Practices

    MVP
    Posted Sep 05, 2022 09:15 AM
    Thanks Herman, we have a public DNS entry that points to all 4 controllers in our cluster. My main goal is to minimize downtime, my thought was if I time it right, the RAPs will only reboot once for the code upgrade, as opposed to going down due to the controllers being unreachable, then coming back only to require a firmware upgrade and to reboot a second time. It's probably a long shot, but I'm hoping to keep the downtime to under 30 minutes in total, so I figured if I started by upgrading just 2 controllers, it should keep them up as long as possible.

    I really appreciate the feedback and will go over my documentation / procedure again prior to the upgrade to make sure everything is thought through.

    ------------------------------
    Michael Haring
    ------------------------------



  • 4.  RE: RAP Cluster Upgrade Best Practices

    EMPLOYEE
    Posted Sep 05, 2022 10:59 AM
    In this scenario, you should focus on predictability.  If you upgrade all 4 controllers and then reboot them at the same time, any access point that is connecting should be upgrading, and you should see them reboot and come back.

    Preloading is nice, but it offers yet another thing to keep track of.  Make things predictable by upgrading all controllers, and then manually rebooting them at the same time.  You can then type "show ap database-summary" on the MM to monitor your upgrade.  With remote APs, every device has different speeds so that is one variable.  You want to minimize all of your variables.  Making controllers available all at the same time and monitoring all devices as they reconnect is the best way to do this.  The time you "lose" by not doing a preload will be regained with predictability.

    EDIT:  A downtime of 30 minutes is very aggressive knowing that you do not have LAN speeds at your disposal and varying hops/mtus and performance on dozens of networks.  I would shoot for 2 hours.  You can then look at the history of any devices that are down to understand what could be keeping them from coming up (IKE throttling, issues on the customer side, etc).

    ------------------------------
    Any opinions expressed here are solely my own and not necessarily that of Hewlett Packard Enterprise or Aruba Networks.

    HPE Design and Deploy Guides: https://community.arubanetworks.com/support/migrated-knowledge-base?attachments=&communitykey=dcc83c62-1a3a-4dd8-94dc-92968ea6fff1&pageindex=0&pagesize=12&search=&sort=most_recent&viewtype=card
    ------------------------------



  • 5.  RE: RAP Cluster Upgrade Best Practices

    MVP
    Posted Sep 05, 2022 11:22 AM
    Thanks Collin for the feedback as well, I will consider reloading all 4 at once to get that done, but preloading still seems like a good idea to reduce downtime right? I know it's another thing to track, but I don't mind pushing the firmware via preload prior to the outage window, to then reload all 4 controllers when the window begins and monitor the RAPs as they reconnect. I would hope that would ultimately keep all RAPs on a level playing field in terms of bandwidth and latency since they will already have the image file, it should be a normal reboot cycle which takes between 5-10 minutes in most of our cases.

    Thanks.

    ---
    Michael Haring





  • 6.  RE: RAP Cluster Upgrade Best Practices

    EMPLOYEE
    Posted Sep 05, 2022 11:33 AM
    Again, I am stressing variability.  You would need to: (1) upgrade all of the controllers without rebooting first because you can only preload the boot partition of a controller onto an access point (2) make sure all of the preloads have completed, or you would need to account for devices that errored out or only partially finished (3) there are also a cap on the number of preloads that you can do simultaneously, so you would need to manage that and that could push your window (4) You only get one shot to preload per controller reboot, so if it doesn't work, you do not necessarily get to try again..Do you then try to track thousands of those devices that didn't complete and attempt to retry them? (5) Some preloads could take long because of a combination of #3 and low bandwidth. 

    Preloading is a good idea on LAN networks due to robust bandwidth removing a variable.  With remote access networks, you do not have physical access to those devices, so you want to make sure you remove as many steps or possibilities for failure.  You cannot depend on the end-user to reboot a device so you want to have enough time, do a few steps and make sure you have enough window for the unforseen.

    My opinion.​

    ------------------------------
    Any opinions expressed here are solely my own and not necessarily that of Hewlett Packard Enterprise or Aruba Networks.

    HPE Design and Deploy Guides: https://community.arubanetworks.com/support/migrated-knowledge-base?attachments=&communitykey=dcc83c62-1a3a-4dd8-94dc-92968ea6fff1&pageindex=0&pagesize=12&search=&sort=most_recent&viewtype=card
    ------------------------------



  • 7.  RE: RAP Cluster Upgrade Best Practices

    MVP
    Posted Sep 05, 2022 11:41 AM
    Excellent points, thank you for the feedback. This is our first AOS 8 upgrade for the RAP environment, so there is quite a few unknowns on our end, but I will take all of your points and recommendations into account for this upgrade. 

    Thanks again!

    ---
    Michael Haring





  • 8.  RE: RAP Cluster Upgrade Best Practices

    MVP
    Posted Sep 12, 2022 06:24 PM
    Upgrade completed successfully - some helpful notes for others:

    - Preloading the firmware had some hiccups. Initially I tried doing so via the WebUI, but for some reason it preloaded the same firmware it was already running, even though I selected the correct partition. Preloading from the CLI was much better and worked exactly as expected.

    - I reloaded controller 1, confirmed cluster fail over, reloaded controller 2, confirmed cluster fail over, waited 2 minutes, reloaded controller 3, confirmed fail over, and then once controller 1 started pinging again, I reloaded controller 4. Overall, took about 35 minutes total before I had 99%+ of the environment operational again. 

    - No issues following upgrade, though a few users had to reboot their PCs.

    ------------------------------
    Michael Haring
    ------------------------------