Recently I got a pair of Aruba S2500 switches and they are configured as a stack with two stack interconnections. Assume nothing is plugged in client wise as this issue occurs reguardless of whats connected. My issues is that after a day or two the SFP+ stack ports between the switches just shut down. The ports are not admin shut down, the LED just turns off and on in about 5 sec increments and the link drops. The stack does not fail over to the secondary connection and it just hangs. This happens on both stack interfaces in the same way. It happens when I drop to just 1 stack interconnection. The only way to revcover is to unseat and reset the SFP module in the switch. The other 10g ports are not affected when connected to say a PC. None of the copper ports have this issue. The issue happens when I drop the stack config and just use traditional trunking ports. Lastly this easy seems to only happen between 11am - 11:30am central - it never fails outside of that time.
The ports do show some Octet errors but thats it and I am not sure if that is the result of the error or the cause. This seems like a straight forward issue with a bad SFP or fiber, but please consider what I have tried to do to fix it:
- Replaced fiber cable (OM3 fiber)
- Tried different SFP+ port (even swapped with port that had no issues with my PC using a standard trunk)
- Tried different 5 different brands of SR 10G SFP+ modules including an authentic Aruba optics.
- Replaced affected optics with LR (single mode) 10G optics on both sides and replaced with single mode cables.
- Tried both a standard trunk, access, as well as stacking
- Issue occurs on both switches, even when split and independent
- Both switches are on stable and UPS conditioned power.
- Firmware is patched to latest (Aruba OS v7.4.0.6), but I also tried an older version (v7.4.0.4)
I have no idea what else to try. Despite everything above the issue still occurs. It usually happens every 24-48 hours. As noted above, I see octet errors on the interface it happens to but thats it and pulling the SFP and reseating it clears the issue.
Here is the errors I would get
Logs and info:
Aruba Operating System Software.
ArubaOS (MODEL: ArubaS2500-24P-US), Version 7.4.0.6
Website: http://www.arubanetworks.com
Copyright (c) 2016 Aruba, a Hewlett Packard Enterprise company.
Compiled on 2018-01-11 at 00:15:44 PST (build 63167) by p4build
ROM: System Bootstrap, Version CPBoot 1.0.34.0 (build 32670)
Built: 2012-03-06 02:43:38
Built by: p4build@re_client_32670
Switch uptime is 1 days 19 hours 51 minutes 4 seconds
Reboot Cause: User reboot (0x86:0x78:0x402b)
Processor XLS 208 (revision A1) with 1023M bytes of memory.
955M bytes of System flash
Activation Key: LZQWURUG
Errorlog Snippet as issue starts (full logs below):
Jun 30 10:14:18 aaa_proxy[1404]: <341312> <ERRS> |aaa_proxy| Unable to connect to MASTER AAA proxy socket:Operation now in progress
Jun 30 10:14:18 nanny[1370]: <399816> <ERRS> |nanny| Terminating process /mswitch/bin/profmgr, pid 4249
Jun 30 10:14:18 nanny[1370]: <399816> <ERRS> |nanny| Terminating process /mswitch/bin/udbserver, pid 4251
Jun 30 10:14:18 nanny[1370]: <399816> <ERRS> |nanny| Terminating process /mswitch/bin/ntpwrap, pid 4252
Jun 30 10:14:18 stackmgr[1467]: <399803> <ERRS> |stackmgr| An internal system error has occurred at file ../ncfg_profmgr.c function ncfg_profmgr_recv_bytes line 65 error recv returned 0, expecting 5.
Jun 30 10:14:18 cfgm[1437]: PAPI_Send: sendto HTTPD Manager failed: Connection refused Message Code 5004 Sequence Num is 208
Jun 30 10:14:18 stackmgr[1467]: <399803> <ERRS> |stackmgr| An internal system error has occurred at file ncfg_gcore.c function ncfg_profmgr_task_based_error_handler line 165 error ncfg_profmgr_task_based_error_handler:Profile manager most probably died, context:0x100006b8. Handling error for task-based app and resyncing with profile-manager.
Jun 30 10:14:18 cfgm[1437]: PAPI_Send: sendto Interface Manager failed: Connection refused Message Code 5004 Sequence Num is 210
Jun 30 10:14:18 cfgm[1437]: PAPI_Send: sendto DHCP Daemon failed: No such file or directory Message Code 5004 Sequence Num is 211
Jun 30 10:14:18 cfgm[1437]: PAPI_Send: sendto Profile Manager failed: Connection refused Message Code 5004 Sequence Num is 212
Jun 30 10:14:18 cfgm[1437]: PAPI_Send: sendto IKE failed: Connection refused Message Code 5004 Sequence Num is 213
Jun 30 10:14:18 cfgm[1437]: PAPI_Send: sendto Authentication failed: Connection refused Message Code 5004 Sequence Num is 214
Jun 30 10:14:18 cfgm[1437]: PAPI_Send: sendto User Database Server failed: Connection refused Message Code 5004 Sequence Num is 215
Jun 30 10:14:18 cfgm[1437]: <307228> <ERRS> |cfgm| Error Accepting a connection to the Master Config socket:Resource temporarily unavailable
Jun 30 10:14:19 im[6606]: <399803> <ERRS> |im| An internal system error has occurred at file ../ncfg_profmgr.c function ncfg_profmgr_setup_prof_socket line 556 error Unable to connect to profmgr: Connection refused.
Jun 30 10:14:19 im[6606]: <330006> <ERRS> |im| Connection failure with profile manager, Unable to connect to profmgr
Jun 30 10:14:19 im[6606]: <399803> <ERRS> |im| An internal system error has occurred at file ncfg_gcore.c function ncfg_gated_profmgr_connect line 86 error ncfg_gated_profmgr_connect:ERROR: ncfg_init failed:14.
Jun 30 10:14:19 cfgm[1437]: PAPI_Send: sendto HTTPD Manager failed: Connection refused Message Code 5004 Sequence Num is 216
Jun 30 10:14:19 cfgm[1437]: PAPI_Send: sendto DHCP Daemon failed: No such file or directory Message Code 5004 Sequence Num is 219
Jun 30 10:14:19 cfgm[1437]: PAPI_Send: sendto Profile Manager failed: Connection refused Message Code 5004 Sequence Num is 220
Jun 30 10:14:19 cfgm[1437]: PAPI_Send: sendto IKE failed: Connection refused Message Code 5004 Sequence Num is 221
Jun 30 10:14:19 cfgm[1437]: PAPI_Send: sendto Authentication failed: Connection refused Message Code 5004 Sequence Num is 222
Jun 30 10:14:19 cfgm[1437]: PAPI_Send: sendto User Database Server failed: Connection refused Message Code 5004 Sequence Num is 223
Jun 30 10:14:21 activate[6641]: PAPI_Send: sendto Profile Manager failed: Connection refused Message Code 0 Sequence Num is 2
Jun 30 10:14:21 authmgr[6610]: <199802> <ERRS> |authmgr| main.c, main:510: Auth started/restarted or switchover happened
Jun 30 10:14:22 qosmgr[6633]: PAPI_Send: sendto Profile Manager failed: Connection refused Message Code 0 Sequence Num is 2
Jun 30 10:14:22 rmon[6636]: PAPI_Send: sendto Profile Manager failed: Connection refused Message Code 0 Sequence Num is 2
Jun 30 10:14:23 cmica[1461]: task_close: close Chassis Agent socket.-1: Bad file descriptor
Jun 30 10:14:23 aaa[6605]: PAPI_Send: sendto User Database Server failed: Connection refused Message Code 0 Sequence Num is 3
Jun 30 10:14:23 mon_ssm[6659]: PAPI_Init: timeout of 0 specified set to default 100 millisec.
Jun 30 10:14:23 mon_ssm[6663]: PAPI_Init: timeout of 0 specified set to default 100 millisec.
Jun 30 10:14:23 mon_ssm[6664]: PAPI_Init: timeout of 0 specified set to default 100 millisec.
Jun 30 10:14:23 mon_ssm[6662]: PAPI_Init: timeout of 0 specified set to default 100 millisec.
Jun 30 10:14:23 mon_ssm[6661]: PAPI_Init: timeout of 0 specified set to default 100 millisec.
Jun 30 10:14:23 stackmgr[1467]: <399803> <ERRS> |stackmgr| An internal system error has occurred at file ../ncfg_profmgr.c function ncfg_profmgr_setup_prof_socket line 556 error Unable to connect to profmgr: Connection refused.
Jun 30 10:14:23 stackmgr[1467]: <330006> <ERRS> |stackmgr| Connection failure with profile manager, Unable to connect to profmgr
Jun 30 10:14:23 stackmgr[1467]: <399803> <ERRS> |stackmgr| An internal system error has occurred at file ncfg_gcore.c function ncfg_task_based_profmgr_reconnect line 131 error ncfg_task_based_profmgr_reconnect:Unable to reconnect with profile-manager. Setting task timer to 5.
Jun 30 10:14:23 ChassisManager[6629]: <335309> <ALRT> |ChassisManager| Power supply detected on slot 0
Jun 30 10:14:23 ChassisManager[6629]: <335308> <ALRT> |ChassisManager| Module 1 2010067 (4-Port) detected on slot 0
Jun 30 10:14:23 ChassisManager[6629]: <335309> <ALRT> |ChassisManager| Power supply detected on slot 1
Jun 30 10:14:23 ChassisManager[6629]: <335308> <ALRT> |ChassisManager| Module 1 2010067 (4-Port) detected on slot 1
Jun 30 10:14:25 cmica[1461]: task_close: close Chassis Agent socket.-1: Bad file descriptor
Jun 30 10:14:25 certmgr[6604]: <118004> <ERRS> |certmgr| Received unknown message
Jun 30 10:14:25 im[6606]: <399803> <ERRS> |im| An internal system error has occurred at file ../ncfg_profmgr.c function ncfg_profmgr_setup_prof_socket line 556 error Unable to connect to profmgr: Connection refused.
Jun 30 10:14:25 im[6606]: <330006> <ERRS> |im| Connection failure with profile manager, Unable to connect to profmgr
Jun 30 10:14:25 im[6606]: <399803> <ERRS> |im| An internal system error has occurred at file ncfg_gcore.c function ncfg_gated_profmgr_connect line 86 error ncfg_gated_profmgr_connect:ERROR: ncfg_init failed:14.
Jun 30 10:14:25 certmgr[6604]: <118004> <ERRS> |certmgr| Received unknown message
Jun 30 10:14:25 publisher[1436]: PAPI_Send: sendto Auth Survival Server failed: Connection refused Message Code 0 Sequence Num is 61
Jun 30 10:14:26 certmgr[6604]: <118004> <ERRS> |certmgr| Received unknown message
Jun 30 10:14:26 profmgr[6609]: <399803> <ERRS> |profmgr| An internal system error has occurred at file profmgr_stk.c function profmgr_nl_sub_cb line 152 error Our role is MASTER, re-syncing with certificate-manager .
Jun 30 10:14:28 authmgr[6610]: PAPI_AddSibyteOpcode: ReRegistering SAME call back function for opcode 0x004b sock = 15
Jun 30 10:14:28 authmgr[6610]: PAPI_AddSibyteOpcode: ReRegistering SAME call back function for opcode 0x009e sock = 16
Jun 30 10:14:28 stackmgr[1467]: <399803> <ERRS> |stackmgr| An internal system error has occurred at file ncfg_gcore.c function ncfg_task_based_profmgr_reconnect line 141 error ncfg_task_based_profmgr_reconnect:Connection to profile-manager established, setting socket..
Jun 30 10:14:30 certmgr[6604]: <118004> <ERRS> |certmgr| Received unknown message
Jun 30 10:14:30 profmgr[6609]: <399803> <ERRS> |profmgr| An internal system error has occurred at file profmgr_stk.c function profmgr_nl_sub_cb line 152 error Our role is MASTER, re-syncing with certificate-manager .
Full log dump and config:
https://1drv.ms/u/s!Ar4F2XsjCm_BhLonTrrg411sLGmyvg?e=YhKyeO
I appreciate any help with this as I am at my wits end. Let me know if I can provide any further details.