This morning we found all of our wired ports are no longer working correctly. All but one of our 48 port switches are suddenly showing up as 24 port with all ports as a stack (this isn't true though).
Aruba TAC is saying they will have some ideas tomorrow but we need things working now... Has anyone else seen this or have any ideas of what could be causing this?
What's your case number? Have you already sent them the log files and such?
Case # 1442563
Engineering reference id # 87531
Logs were uploaded to case # 1436804
Looks like there is an issue for the communication between Primary and other members. Based on the lags, stack ports looks OK.
Looking in to the issue....
It doesn't look like from the CLI outputs that the base ports have become stack ports so that may just be a glitch in the WebUI output, but clearly something has occured to the stack. Have you rebooted the stack? If so, how quickly does it get back into this state? One configuration change I would make with respect to stacking is setting the primary and secondary switch election priorities (member X election-priority 255 under stack-profile) where X is the member ID of the switches that should be primary and secondary.
Also, the logs show a lot of STP related events (flushing) which could be unrelated but if they are continual could be impacting overall health of the stack, we fixed an issue in STP in 184.108.40.206 which is available on the support page. I would also recommened enabling portfast on the access ports.
I rebooted one member, and after the reboot the switch comes back online quickly but the ports on the switch no longer show a link light (and nothing works on that switch now)
I'm very worried if we reboot the stack all of the switches will go to this state which would put us completely dead in the water (currently the APs that were on the switches still work) so our students have WiFi.
When you run "show interface brief" normally doesn't that show all of the non-stacking ports in the stack? When I run it I only see it on the switch I'm connected to (the primary master).
I've only worked on these switches for about a month now - so I'm still getting used to them.
Thank you for your help - already you have provided more input then TAC....
Yes, it should be showing all the interfaces in the stack. The log file shows two interesting datapoints, the show inventory is only showing member 2 and the show interface is only showing ports on member 2. Essentially that member is looking like an Island however the show stacking members seems to be showing the other members.
When I first read the case, I thought there was no connectivity but I missed the reference to the APs still being active.
Are you consoled to the switch? If this was a healthy stack, you should be able to connect to the console of any switch and then it will re-direct you to the primary. Are you seeing that behavior?
Also, I can make myself available this evening if we could reboot the stack. I do believe the stack will recover and if not, I can help rebuild it with you and make sure we encorporate some of the best practises I use.
Thank you for that response! Maybe we can talk this afternoon depending on where the things are I might take you up on this offer.
I'll PM you with my contact info.
Special thank you to Vinod who worked on the case all day with me! We found removing the current stack primary causing it to switch to a different switch resolved the issue then putting the switch back in. We (well the smart guys at Aruba while I sit around and watch) are still looking into the primary route cause for all of this.
I have to say I'm happy once we got the correct engineer on the case how quickly progress was made, but I'm sad how long things were wasted leading up to that.
Thank you very much for posting this!
I had the same issue and I disconnected the primary from the stack and reconnected it and it resolved the issue.
I do wish we knew how this happened cause its bound to happen again.
Could you tell us what version of code you were running as well as share a copy of the configuration?
Here is the version and screenshot of the topology
Aruba Operating System Software.ArubaOS (MODEL: ArubaS2500-24P-US), Version 220.127.116.11Website: http://www.arubanetworks.comCopyright (c) 2002-2013, Aruba Networks, Inc.Compiled on 2013-06-19 at 07:05:29 PDT (build 38712) by p4buildROM: System Bootstrap, Version CPBoot 18.104.22.168 (build 36057)Built: 2012-11-06 23:15:03Built by: p4build@re_client_36057Switch uptime is 19 days 20 hours 7 minutes 24 secondsReboot Cause: User reboot.Processor XLS 208 (revision A1) with 1023M bytes of memory.955M bytes of System flash
Unfortunately there isn't enough to go on based upon the topology or the software release. I had hoped it was an older release but 22.214.171.124 is relatively new. If the issue does occur again, please open a TAC case immediately. We continue to review the data provided by NeumontU but we haven't found a root cause yet.
Attached is the configuration.
The switches #0 and #1 are connected by the stacking cable switches #2 and #3 are connected by 10Gb LRM fiber.
I provisioned the stack a few weeks ago. I upgraded the firmware to the latest version. I was unable to put switch #2 into the rack until i could get a maintenace window to take an older switch out of the rack. I had switch #2 connected in the closet but not mounted in the rack. Everything was working fine
I swapped the switches, powering cycling switch #2 and when switch #2 came back online none of the ports of switch #2 responded. I checked the web management interface and it showed all the ports as stacking for switch #2.
The fiber connection on the stack showed as down. I swaped the fiber connections of switch #2 and switch #3 to troubleshoot to see if maybe I had damaged the fiber in moving switch #2. Both switches stack interface came up but both switches showed all there ports as stacking ports.
The whole time the CLI showed the correct topology. Switch #3 was still servicing APs before and after I swapped the fiber. The APs on switch 0,1, and 3 did not have any issues during the whole events.
I'll pass along the steps and config to Engineering to see if they can replicate it. If something occurs before I get back to you, please open a TAC Case.
I also went over your config and have a few recommendations.
For your access-ports connected to servers or workstations, I recommend enabling portfast. It helps with two things, the first being that the port immediately transitions to forwarding when the port comes up but more importantly, if the port is bouncing up and down, it won't cause the switch to generate any STP topology change notifications which in turn cause MAC addresses to get flushed. Also just to confirm, are the APs in bridging mode hence why all the AP interfaces are configured as trunks?
There are a lot of ports that are configured the same, you could take advantage of interface-groups to simplify the config. A group can have blocks of ports across multiple members.
The stack profile is configured as follows:
stack-profile member-id 0 election-priority 200 member-id 1 election-priority 199 member-id 2 election-priority 100 member-id 3 election-priority 100!
There are two things I see here, the first is that any new member attached to the stack will end up having a higher election-priority than the existing line cards since the default is 128. I typically recommend leaving line card members at their default which is 128 to avoid any unexpected role changes. For example, if the secondary switch were to fail, a line card would be promoted to that role, but if a new non-replacment switch for the secondary came with it's defaultsy (128), it would take over which isn't detrimental to the stack persay but again, I just like to reduce any potential role churn.
The second thing is with respect to members 0 and 1. When you have different priorities for the primary and secondary member, this leads to pre-emption in the event of a failure. What I mean is that if member 0 were to fail, member 1 would take over, but if member 0 comes back, it will re-take it's role as primary which again causes churn in the stacking roles. To avoid this, I typically use the same priority value for both since it really doesn't matter which is primary or secondary. I also set this value to the maximum of 255 to ensure that no misconfiguration can cause a switchover.
I was told I should upgrade to 126.96.36.199, and that the actual bug has been identified. If you have this problem I'd try to move to the latest code release from the sounds of it (at this time is 188.8.131.52).
Madani - These switches are at the core and distro layer. Anything that's plugged into it is pretty much on 24/7 so I am not worried about port fast. All the servers are virtual so the ports connect to virtual switches.
The APs are currently tunneled but I plan on converting them to bridge mode down the road. The 3600 Controller is becoming a more of a bottle neck as the density of the wireless network increases every year.
I haven't looked into the group settings yet, but I probably should.
As far as the stack settings, I added the settings for 2 and 3 during the troubleshooting of the issue. I understand the point preemption but I really hope that I don't have to worry about the primary failing all the time. I am hoping the only time it ever switches roles in the stack is when I unplug the switch during a maintenance window. I like knowing that 0 should be the primary. If the client calls and says they are experiencing an issue, I can have them read out the displays on the switches over the phone and know if there is an issue with the stack when they tell me that 1,2 or 3 says primary.
I upgraded the stack to 184.108.40.206 just now . I read through the release notes but I did not find any mention of this issue.
Hopefully the firmware takes care of it.
Our development engineers continue to examine the logs that we have from the case NeumontU opened. We are also trying to replicate the issue in house. The recommendation to upgrade to 220.127.116.11 is based upon the fact that we did address some stacking related issues in that release which could be related the problem you both encountered. At the very least, upgrading eliminates those issues from the equation so to speak. Once I have more details from engineering on what we have determined from the logs, I promise to circle back on this thread.
Engineering continues to dig into this issue and some data in the logs has shifted their focus to the stack link state of the primary member switch. Obviously both your environments are functioning right now (I hope) so this may not give us much data but could you send me either here or through private message the output of the following:
(host) #show stacking interface member all | include "Link flaps"
Here is my output:
(ArubaS2500-48P-US) #show stacking interface member all | include "Link flaps"Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: N/A, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: N/A, MTU: 9224 bytes, Link flaps: 0
I did learn the other day that the night before we found our network like this our cable installer came in and tested many of our fiber pairs between each floor (which means he unplugged the connection briefly for the certification test).
Thanks for the info. It looks like two of the stack links are down. Is that expected for your topology?
At this time, yes it is - in the coming weeks that will be fixed.
(PTHS-ArubaCore) # show stacking interface member all | include "Link flaps"Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: N/A, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 3895Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 0Speed: N/A, MTU: 9224 bytes, Link flaps: 0Speed: N/A, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 24Speed: N/A, MTU: 9224 bytes, Link flaps: 0Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 1
Are this link flaps value incrementing?
Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 3895
Speed: 10 Gbps, MTU: 9224 bytes, Link flaps: 24
What type of SFP is this? How long has the stack been up?
This is the Connection between 1 and 2. I checked the values again and they are all the same. The stack was rebooted a day ago for the firmware upgrade.
--------Vendor Name : FINISAR CORP.Vendor Part Number : FTLX1371D3BCLAruba Supported: : YESCable Type : 10GBASE-LRMConnector Type : LCWave Length : 1310 nm
Okay, that is a lot of flaps for such a short period of time. It is also interesting to see it only affect one side of the link. Would you mind sending the "show stacking interface member all". I'm interested to see if you're taking errors on that link.
You previously said this: "The switches #0 and #1 are connected by the stacking cable switches #2 and #3 are connected by 10Gb LRM fiber." So #2 is connected to #1 and looking at the WebUI sample you provided, #3 is connected back to #0 correct? Since there is no redundant path for switch #2, have you had any network connectivity complaints today? What can you tell me about the fiber? What type (FDDI/OM1/OM2)? Distances?
stack1/3--------stack1/3 is administratively Enabled, link is Up, line protocol is UpSpeed: 10 Gbps, MTU: 9224 bytes, Link flaps: 3895Last update of counters: 0d 00:00:02 agoLast clearing of counters: 1d 16:11:59 agoLink status last changed: 0d 01:55:59 agoStatistics: Received 8381126 frames, 1988969564 octets 30 pps, 50.974 Kbps 466453 broadcasts, 0 runts, 0 giants, 0 throttles 21426757 error octets, 30492 CRC frames 67171 multicast, 7847502 unicast Transmitted 11807627 frames, 4720863499 octets 49 pps, 56.207 Kbps 1011775 broadcasts, 0 throttles 2013447 multicast, 8782405 unicast 0 errors octets, 0 deferred 0 collisions, 0 late collisions
stack1/3--------stack1/3 is administratively Enabled, link is Up, line protocol is UpSpeed: 10 Gbps, MTU: 9224 bytes, Link flaps: 24Last update of counters: 0d 00:00:07 agoLast clearing of counters: 0d 04:39:27 agoLink status last changed: 0d 01:58:01 agoStatistics: Received 1571003 frames, 900697722 octets 53 pps, 70.604 Kbps 122286 broadcasts, 0 runts, 0 giants, 0 throttles 7768 error octets, 6 CRC frames 273529 multicast, 1175188 unicast Transmitted 1057029 frames, 249259281 octets 38 pps, 68.444 Kbps 59138 broadcasts, 0 throttles 8631 multicast, 989260 unicast 0 errors octets, 0 deferred 0 collisions, 0 late collisions
Yeah, you're taking a lot of errors more so on Member 1. You may want to check that fiber link and it looks like there was a flap about two hours ago. What can you tell me about the fiber? What type (FDDI/OM1/OM2)? Distances? Was it previously just supporting GE?
Correct, there is no redundant link.
We had reports of brief disconnections the day before I did the firmware upgrade. Since the firmware upgrade, I have not heard any mores reports. The APs on that switch had rebooted and the logs matched the time frame when the users rebooted issues. The APs have not shown a loss of connectivity since the firmware upgrade.
The unfortunatly the fiber is OM1 fiber and we are using mode conditioning patch cables on each side. The run is about 320ft .
The run from 0 to 3 is just under 400ft and using the same type of media and no issues.
Could be a problem with the SFP. Do you have any spares? At the very least, swap them around to see if the problem follows the SFP to eliminate them from the equation.
I definetly will give that a try. I also want to swap around the mode conditioning cables. I will do this next time I am onsite next week.
Could you also issue "tar logs tech-support" and send over the generated logs.tar file. I can send you a file request from our file transfer service via your email. You can private message me your contact details for that.
I tried calling you and left a voicemail. We have a potential fix for this issue with a customer patch. i tried calling you but was not able to reach you. Please Unicast me your contact details, i will upload you the patch and reach out to you.
I ended up switching the stack links between closets from 10Gbs to 1Gbs.
After more users started to come back and I saw more traffic from both remote closets, I saw the same link flapping on the other closet.
My overall guess is that the OM1 fiber with the mode conditioning cables was not sufficent to do 10Gbs with the LRM SFP+. I'm guessing the stacking issue is partly triggered by reforming the stack after a link failure.
After switching over the 1Gbs SFP. I have not had any issues. I am planning on installing OM3 fiber and trying the 10Gbs again.
We run OM3 in our building. It seems like our issue was also triggered by reforming the stack after a link failure/change. We haven't installed the patch supplied yet, but when we do - if I can find some downtime I might failover to a different switch to see if it happens again. But that is hard to make happen...
At Aruba, we believe that the most dynamic customer experiences happen at the Edge. Our mission is to deliver innovative solutions that harness data at the Edge to drive powerful business outcomes.
© Copyright 2020 Hewlett Packard Enterprise Development LPAll Rights Reserved.