Found a recurrence of a funny I had in versions of CPPM before 6.6 ( can't remember which version).
Before I raise TAC case (problem is you can't reproduce this to order so hard to track down) I thought I'd run this past peope on the list to see if anyone else has seen it.
When creating CPM services I configure them to proxy RADIUS acounting packets to a 3rd party server ( or servers). This other server (FR 3.0.15) then process the accounting packets and stores data in a postgresql db and sends it off to a logstash server. Doing this ensures that we have a centralisd set of RADIUS accounting data covering our whole RADIUS authentication service not just CPPM related authentication
At one point in the past what I found was that if I set up too many services with this acounting proxy config the RADIUS service process would stop. You then either had to restart it by hand or wai for a watchdog process to do it for you. Either way there was a loss of authentication for a while.
This issue seemed to go away in CPPM 6.6 releases and I've got a fair number of services proxying accounting data quite happily.
up till now ....
I have a CPPM 6.6.7 test cluster comprising of a 5K hardware appliance and a 25K VM. This cluster services the wired and wireless connectivity in our building ... its not a big building with few wifi connected devices, something that the smallest CPPM server should cope with easily.
This 6.6.7 cluster config was copied from the production 6.6.5 cluster.
For a while we've been having issues with the radius service process stopping or failing to successfully authenticate EAP-TLS connections. We perform EAP based health checks for EAP-PEAP/MSCHAPv2 and EAP-TLS every few mins/seconds as appropriate and get SMS messages if there is a failure in the healthcheck authentication process.
4 Days ago I removed all the accounting proxying from the 6.6.7 cluster apart from 1 dealing with our wifi connectivity to eduroam..... and to date we haven't had any service disruption.... not a peap!
Because this is a random issue (server can run for hours or days before a problem arises) its hard to track down.
I'm guessing there's a "running out of threads" issue here but have never found any logs to show what's happening. Don't particularly want to upgrade our production clueter to 6.6.7 and then have things die on us
Anyone seen this sort of thing?
Rgds
Alex