Today I experienced an odd situation at one of our sites to which I have no clue why it happend.
At this particular site we use a 5406R zl2 Switch as a core which has four modules running on KB.16.02.0010. Each module has eight 10 Gbit ports. We have multiple HP 2530-48G-PoE+-2SFP+ (J9853A) running on YA.16.02.0010 and connected to the core with LACP trunks to two modules. Trunk1 consists of SFP ports 49 and 50 of the first 2530 which are connected to A1 and B1 of the 5406 and so on. Each SFP port in this star topology has a HP J9150A 10 Gbit transceiver and vendor certified cabling.
This topology has been running for three years with no hassle until this morning when trunks 1 (A1-B1), 4 (A4-B4), 5 (A5-B5), 8 (A8-B8) and 10 (C2-D2) suddenly went off-line. Local IT support inspected the core switch and impacted switches visually and saw that only the leds of the corresponding trunk ports were off. All other trunks, switches and core switch were operating normally. We examined the logs on the core switch (sh logging -r) but found no explainatory messages to determine the cause.
As the site is located remotely and we only had remote hands we decided to power-off and power-on the impacted 2530 switches. This luckily resolved our issue as the trunks became operational after the forced reboot ;) Afterwards I examined the log (sh logging -r) of each impacted 2530 switch but found no messages which helped me to understand what happend.
I checked our monitoring solutions (PRTG and Observium) for warning, errors, excessive traffic and did a health, trend and perfomance review but found no indications which explains why five trunks suddenly went off-line.
This really bothers me and I don't like it!
Has anyone experienced this behaviour? Do I miss some configuration which explains I do not see anything in the logging and monitoring? Is it software bug? Are there any other switch logs I can review?
Were involved ports with Link LED off (Layer 1) or just Activity LED off?
Generally log reports useful information (a module crash too)...did you check also involved access switches before rebooting (or, better, have you a syslog server to collect logs)?
Can you post sanitized running configuration(s) of involved switches? also a show tech all would be of help.
KB.16.02.0010 is pretty old...it's time to update to latest available on KB.16.02 branch (KB.16.02.0028)...just to stay safe...and do the same on the access side.
Thank you, I appreciate your time and reaction.
The leds of the involved ports were off at the core and access switches.
In the core switch the log showed the ports and trunks as off-line.
I was unable to check access switches as I was located remotely and the trunk of each access switch was off-line. As the event impacted multiple users and instructing local IT would take too much time we decided to reboot switches as soon as possible.
Switches are currently not configured to log to a syslog server.
Based on your response I understand an event should have been logged in the log of the core and access switches and that there aren't any other device logs on the switches I could possibly review?
Point taken and noted, will prepare and excute the updates ;)
Attached you will find sanitized running config of the involved switches.
Please let me know your initial findings.
At first sight...the loop-protect protection mechanism should be enabled only on edge ports (generally used for hosts access) and not also on (single link- | aggregated links-) uplink physical/logical ports...as you did, as example, on core switch (loop-protect E1,E9-E24,F1,F9-F24,Trk1-Trk23) and on access switches (loop-protect 1-48,Trk1).
IMHO you should start by fixing that with these:
no loop-protect Trk1-Trk23 (on Core)
no loop-protect Trk1 (on each Access Switch where Trk1 uplinks to Core)
I would set spanning tree to RSTP with a priority of 0 (highest) on the core (spanning-tree priority 0 force-version rstp-operation), and leave just default (8) on access switches uplinked to core (spanning-tree priority 8 force-version rstp-operation)...plus:
Thanks for your input.
I value your suggestions.
However ... First of all I want to have an understanding what could be the cause why 5 out of 15 trunks went off-line.
I have reviewed the configs of the involved switches with the config of switches which kept operating and they are all the same in regard to loop protect and spanning-tree.
So could these settings you recommend to change be the cause why this happend? If so can you explain why the topology ran for nearly three years without issues? Furhermore I am interested in your suggestions if the config needs additional commands to improve logging in case a similar issue occurs in the future. The fact I can't find any errors or warnings just before it happend really bothers me.
The only odd thing I have been able to find so far, is the timestamps in the logs of the involved switches. All recorded events in the log were very outdated before the trunks went off-line and were current after the forced reboot.
I 11/13/19 09:15:54 04611 job: Job Scheduler enabledI 11/13/19 09:15:37 04910 ntp: All the NTP server associations reset.I 11/13/19 09:15:37 04909 ntp: The NTP Stratum was changed from 16 to 4.I 11/13/19 09:15:37 04908 ntp: The system clock time was changed by 1573632353sec 932458401 nsec. The new time is Wed Nov 13 09:15:36 2019.I 01/01/90 01:06:55 00076 ports: port 1 is now on-lineI 01/01/90 01:06:53 00435 ports: port 1 is Blocked by STPI 01/01/90 01:06:06 00076 ports: port 17 is now on-lineI 01/01/90 01:06:03 00435 ports: port 17 is Blocked by STP
That's because - IIRC - clock resets due the reboot (I recally I read a note about this was fixed so the clock would cache the last value and preseve that until the switch is able to re-synchronize to the time source it had before the reboot)...so the time step 01/01/90 (no clock reference yet) --> 11/13/19 (NTP reacheable and time synchronized).
I too haven't found a evident reason...apart suggesting few configuration adjustments...there is no evident misconfiguration on posted configs...at least nothing I can recognize looking at them.
Okay, thank you for reviewing and recommended config adjustments.
In your opinion, what could be the cause(s) fiberports of trunks going off-line and led indicators off presuming links were not physically disconnected?
As it were only 5 of the 16 trunks I am more likely to suspect the access switches of having an issue rather than the core.
At Aruba, we believe that the most dynamic customer experiences happen at the Edge. Our mission is to deliver innovative solutions that harness data at the Edge to drive powerful business outcomes.
© Copyright 2021 Hewlett Packard Enterprise Development LPAll Rights Reserved.