View Only
last person joined: yesterday 

Expand all | Collapse all

Switch Self Restarts With No Crash Logs

This thread has been viewed 7 times
  • 1.  Switch Self Restarts With No Crash Logs

    Posted Feb 22, 2024 06:38 PM

    Hi Everyone!

    I got an alarm from IMC stating one of my 5412Rzl2 J9851A switches had a cold start. This was odd, as I was under the impression that cold starts can only happen after power loss and the other switch in the same rack/power circuit did not lose power. Our EC verified there were no power events on that rack (they have power monitors). Digging into the logs, I saw this: 

    I 02/01/24 02:10:58 03006 system: AM1: Reason for system reboot: Self reset
    M 02/01/24 02:10:58 00064 system: AM1: Health Monitor: Restr Mem Access
    I 02/01/24 02:10:58 00063 system: AM1: System went down: 02/01/24 02:10:57
    I 02/01/24 02:10:58 00061 system: AM1: -----------------------------------------
    I 02/01/24 02:10:55 03803 chassis: xM1: System Self test completed on Master

    I feel like this switch crashed but it does not show any reason why it self reset. I've been poking around in various forums looking for information but I can't find anything. I'm running KB.16.02.0013 and didn't see any major issues relating to this problem. Has anyone else experienced this?

  • 2.  RE: Switch Self Restarts With No Crash Logs
    Best Answer

    Posted Feb 23, 2024 02:41 AM

    Hi, I can confirm this switch will crash but then show messages that suggest it was a power cycle. If you have two power supplies then that rules out a very local issue (plug not fitting right) and you've confirmed the rack had power.

    Two things to look at:

    1) show boot. 

    switch12# show boot

    Mgmt Module 1 -- Saved Crash Information (most recent first):
    ID: 49747532
    Active system went down: 02/19/15 11:54:57 K.15.10.0015m 616
    Operator warm reload.

    No Core-dump Files Present.

    Note how long the crash info is saved for.

    2) show modules - This has a column if an individual module (line card) has a dump. I've seen enough times a crash happen when a 'hot swap' module was inserted. Not that someone was there at the time for you but it might give a hint.

    Also, try copying the crash-logs (copy crash-files mm-active tftp...) to see if there is anything in them.

    If you see nothing in all of the above I would put this down to a hardware failure.

  • 3.  RE: Switch Self Restarts With No Crash Logs

    Posted Feb 28, 2024 01:37 PM

    Thank you so much for replying! Here is my show boot: 

    hostname# sh boot
    Mgmt Module 1 -- Saved Crash Information (most recent first):
    ID: e9d62a4e
    Active system went down: 02/01/24 02:10:57 KB.16.02.0013 528
    Health Monitor:  Restr Mem Access
    HW Addr=0x00000000 IP=0xf679c7c Task='mIpPktRecv' Task ID=0x3fb02640
    sp:0x1e0578b8 lr:0xf4e1424
    msr: 0x02029200 xer: 0x00000000 cr: 0x24000400
    Mgmt Module 2 -- Saved Crash Information (most recent first):
    Slot D -- Saved Crash Information (most recent first):
    ID: 85c85f55
    Slot D subsystem went down: 05/28/22 03:38:42 KB.16.02.0013 562
    Software exception in ISR at interrupts_ahs.c:5247
    -> FR Int ERROR_STATUS =0x18000020
    MM:Management Module, IM: Interface Module
    TimeStamp         Type    Core Dump File Name         Build Version
    ----------------- -------- ---------------------------- -----------------
    02-01-24 02:10:23 MM1     M_SG71G4C0KC.cor              KB.16.02.0013
    04-01-91 14:36:10 Slot D  I_SG7297R01N.cor             KB.16.02.0013

    So if I'm reading this output correctly, there was a memory issue which caused Module D to have an issue which caused the crash? Sh mod indicates there is a core dump on every line card but that's true for all my modules onsite and unfortunately due to security I can't tftp my files anywhere. 

  • 4.  RE: Switch Self Restarts With No Crash Logs

    Posted Feb 28, 2024 03:39 PM

    You had a crash of the main module at 02-01-24 02:10:23
    In my experience the only way to know the cause of the crash is to send the file to support. The answer has always been upgrade to the version a big is fixed. So upgrading to the latest when available could short circuit the process. 

    The one thing to add is that I also have crash indicator on every module on every 5406. I think they stay forever so will be historical. Especially when POE is involved. 

    If the crash happens again you might need to raise a call with the info you sent to see if there is a clue. The version and error code is sometimes enough.