Wired Intelligent Edge

 View Only
  • 1.  Five out of 15 trunks suddenly off-line

    Posted Nov 13, 2019 04:40 PM

    Hi,

     

    Today I experienced an odd situation at one of our sites to which I have no clue why it happend. 

     

    At this particular site we use a 5406R zl2 Switch as a core which has four modules running on KB.16.02.0010. Each module has eight 10 Gbit ports. We have multiple HP 2530-48G-PoE+-2SFP+ (J9853A) running on YA.16.02.0010 and connected to the core with LACP trunks to two modules. Trunk1 consists of SFP ports 49 and 50 of the first 2530 which are connected to A1 and B1 of the 5406 and so on. Each SFP port in this star topology has a HP J9150A 10 Gbit transceiver and vendor certified cabling.

     

    This topology has been running for three years with no hassle until this morning when trunks 1 (A1-B1), 4 (A4-B4), 5 (A5-B5), 8 (A8-B8) and 10 (C2-D2) suddenly went off-line. Local IT support inspected the core switch and impacted switches visually and saw that only the leds of the corresponding trunk ports were off. All other trunks, switches and core switch were operating normally. We examined the logs on the core switch (sh logging -r) but found no explainatory messages to determine the cause.

     

    As the site is located remotely and we only had remote hands we decided to power-off and power-on the impacted 2530 switches. This luckily resolved our issue as the trunks became operational after the forced reboot ;) Afterwards I examined the log (sh logging -r) of each impacted 2530 switch but found no messages which helped me to understand what happend.

     

    I checked our monitoring solutions (PRTG and Observium) for warning, errors, excessive traffic and did a health, trend and perfomance review but found no indications which explains why five trunks suddenly went off-line.

     

    This really bothers me and I don't like it!

    Has anyone experienced this behaviour? Do I miss some configuration which explains I do not see anything in the logging and monitoring? Is it software bug? Are there any other switch logs I can review?  

     

    Regards,

    Raymond


    #2530


  • 2.  RE: Five out of 15 trunks suddenly off-line

    Posted Nov 13, 2019 07:04 PM

    Pretty strange.

     

    Were involved ports with Link LED off (Layer 1) or just Activity LED off?

     

    Generally log reports useful information (a module crash too)...did you check also involved access switches before rebooting (or, better, have you a syslog server to collect logs)?

     

    Can you post sanitized running configuration(s) of involved switches? also a show tech all would be of help.

     

    KB.16.02.0010 is pretty old...it's time to update to latest available on KB.16.02 branch (KB.16.02.0028)...just to stay safe...and do the same on the access side.



  • 3.  RE: Five out of 15 trunks suddenly off-line

    Posted Nov 15, 2019 03:24 AM

    Thank you, I appreciate your time and reaction.

     

    The leds of the involved ports were off at the core and access switches.

    In the core switch the log showed the ports and trunks as off-line.

     

    I was unable to check access switches as I was located remotely and the trunk of each access switch was off-line. As the event impacted multiple users and instructing local IT would take too much time we decided to reboot switches as soon as possible.

    Switches are currently not configured to log to a syslog server.

     

    Based on your response I understand an event should have been logged in the log of the core and access switches and that there aren't any other device logs on the switches I could possibly review? 

     

    Point taken and noted, will prepare and excute the updates ;) 

    Attached you will find sanitized running config of the involved switches.

     

    Please let me know your initial findings.

    Regards,

    Raymond



  • 4.  RE: Five out of 15 trunks suddenly off-line

    Posted Nov 15, 2019 08:31 AM

    At first sight...the loop-protect protection mechanism should be enabled only on edge ports (generally used for hosts access) and not also on (single link- | aggregated links-) uplink physical/logical ports...as you did, as example, on core switch (loop-protect E1,E9-E24,F1,F9-F24,Trk1-Trk23) and on access switches (loop-protect 1-48,Trk1).

     

    IMHO you should start by fixing that with these:

     

    no loop-protect Trk1-Trk23 (on Core)

    no loop-protect Trk1 (on each Access Switch where Trk1 uplinks to Core)

     

    commands.

     

    I would set spanning tree to RSTP with a priority of 0 (highest) on the core (spanning-tree priority 0 force-version rstp-operation), and leave just default (8) on access switches uplinked to core (spanning-tree priority 8 force-version rstp-operation)...plus:

     

    • protecting core's (physical | logical) downlinks to access switches with root-guard (spanning-tree ethernet trk1-trk23 root-guard).
    • setting downlinks frome the Core standpoint (and uplink on the access side the same way) as Point-to-Point links with spanning-tree ethernet trk1-trk23 point-to-point-mac true.
    • fix access/edge ports (on Core switch and/or on Access switches) with the admin-edge property (portfast) using the command spanning-tree ethernet <access-ports-list> admin-edge-port and consequently spanning-tree ethernet <access-ports-list> point-to-point-mac false (see point above) on the very same ports.


  • 5.  RE: Five out of 15 trunks suddenly off-line

    Posted Nov 15, 2019 10:09 AM

    Thanks for your input.

    I value your suggestions.

     

    However ... First of all I want to have an understanding what could be the cause why 5 out of 15 trunks went off-line.

     

    I have reviewed the configs of the involved switches with the config of switches which kept operating and they are all the same in regard to loop protect and spanning-tree.

     

    So could these settings you recommend to change be the cause why this happend? If so can you explain why the topology ran for nearly three years without issues? Furhermore I am interested in your suggestions if the config needs additional commands to improve logging in case a similar issue occurs in the future. The fact I can't find any errors or warnings just before it happend really bothers me.

     

    The only odd thing I have been able to find so far, is the timestamps in the logs of the involved switches. All recorded events in the log were very outdated before the trunks went off-line and were current after the forced reboot.

     

    I 11/13/19 09:15:54 04611 job: Job Scheduler enabled
    I 11/13/19 09:15:37 04910 ntp: All the NTP server associations reset.
    I 11/13/19 09:15:37 04909 ntp: The NTP Stratum was changed from 16 to 4.
    I 11/13/19 09:15:37 04908 ntp: The system clock time was changed by 1573632353
    sec 932458401 nsec. The new time is Wed Nov 13 09:15:36 2019
    .
    I 01/01/90 01:06:55 00076 ports: port 1 is now on-line
    I 01/01/90 01:06:53 00435 ports: port 1 is Blocked by STP
    I 01/01/90 01:06:06 00076 ports: port 17 is now on-line
    I 01/01/90 01:06:03 00435 ports: port 17 is Blocked by STP

     

    Regards,

    Raymond



  • 6.  RE: Five out of 15 trunks suddenly off-line

    Posted Nov 15, 2019 11:10 AM

    That's because - IIRC - clock resets due the reboot (I recally I read a note about this was fixed so the clock would cache the last value and preseve that until the switch is able to re-synchronize to the time source it had before the reboot)...so the time step 01/01/90 (no clock reference yet) --> 11/13/19 (NTP reacheable and time synchronized).

     

    I too haven't found a evident reason...apart suggesting few configuration adjustments...there is no evident misconfiguration on posted configs...at least nothing I can recognize looking at them.



  • 7.  RE: Five out of 15 trunks suddenly off-line

    Posted Nov 19, 2019 04:00 AM

    Okay, thank you for reviewing and recommended config adjustments.

     

    In your opinion, what could be the cause(s) fiberports of trunks going off-line and led indicators off presuming links were not physically disconnected?

     

    As it were only 5 of the 16 trunks I am more likely to suspect the access switches of having an issue rather than the core.

     

    Regards,

    Raymond