Hi guys,
we have been running a 3 node cluster for the last 3-4 years and have had growing issues moving from version to version and also activating OnGuard as a Posture mecanism.
We've had multiple loss of access to our publisher as the load simply skyrockets and the httpd process is completely hammered by the WebAuths. Here is the configuration we have :
C2000 x 3
1x Publisher
2x Subscribers
All load-balanced across the servers. InsightDB only on a single subscriber as, again, low being an issue.
Since httpd is used for WebUI and OnGUard, is it recommended to have the Publisher node do the smaller lifting(TACACS,Management) and leave Radius/Webauth to the subscribers?
I am unsure if the newer versions are more taxing on the servers or it is simply bugs, but we have come across 2-3 times in the last few weeks where our WebUI is almost non-responsive, the server rejects TACACS because of load and we are unable to do anything since we can't access the WebUI.
Thinking of the Wireless controller approach, we thought maybe it would be a good idea to completely remove WebAuth from the Publisher so that it would never impact the HTTPD process and move TACACS only to that server for now. Typically, the servers are stable and don't tend to crash, but since this version (6.7.9 now) we have had some load issues that caused short outages and today we had a 9 hour long call with TAC to regain access to the publisher since it lost it during the weekend and after the 24 hour period, the subscribers became Publishers... everything went bad.
Everything is back now, but if feels as though this will repeat again unless we modify the current setup. We may be looking into adding a 4th node to split the load even more and have a failover publisher and TACACS redundancy, but that is down the road.
All this to ask.. is the typical design to have the publisher act like a Master controller and leave the grunt work to the Subscribers? I never came across this in the past as OnGuard was never in play and so the HTTPD process was not taxed the way it is now.
We have around 4000 devices, probably 1500-1700 concurrent.
Any insight would be appreciated.