04-17-2017 05:53 AM - edited 04-17-2017 05:56 AM
Just to iron out a few features I've disable background spectrum-monitoring, set air-time-fairness to default and disabled client-match, but am still having similar problems.
Attached is a pcap with this configuration, where the client successfully negotiate BA, and uses A-PMDU to transfer data, but is only able to push about 40Mbps!
This time I almost don't see any TxRetries, Client_health is 100/100, and bitrate is still 400M. Everything just looks perfect - but it's just slow! So freaking annoying :)
04-17-2017 07:52 AM
I've now been doing some testing on 2.4GHz, and I seem to have the same problem there.
At first everything looked perfect, I connected with a bitrate of 144Mbps, and could transfer 80-90Mbps both up and down, then suddenly half an hour later, I can now only upload about 40Mbps, but can still download 80Mbps.
Confirmed with wireshark, BA is still being used in this case.
04-17-2017 09:33 AM
I just made a very interesting observation.
I've shutdown all clients, rebooted the IAP running the latest code (126.96.36.199), and connected my single test-client. It was running fine for a few minutes, then the problem started and I were only able to upload about 30M, but still download 200M.
But, the interesting part is that every time I execute "show tech", or "show cpu" on the IAP, the throughput increases to normal for a few seconds:
[ 4] 72.01-73.01 sec 3.50 MBytes 29.2 Mbits/sec [ 4] 73.01-74.01 sec 3.25 MBytes 27.2 Mbits/sec [ 4] 74.01-75.01 sec 3.25 MBytes 27.4 Mbits/sec [ 4] 75.01-76.01 sec 3.50 MBytes 29.2 Mbits/sec [ 4] 76.01-77.00 sec 3.25 MBytes 27.5 Mbits/sec [ 4] 77.00-78.01 sec 3.50 MBytes 29.3 Mbits/sec [ 4] 78.01-79.00 sec 3.50 MBytes 29.4 Mbits/sec [ 4] 79.00-80.00 sec 3.25 MBytes 27.3 Mbits/sec [ 4] 80.00-81.00 sec 3.25 MBytes 27.3 Mbits/sec [ 4] 81.00-82.00 sec 3.50 MBytes 29.3 Mbits/sec [ 4] 82.00-83.00 sec 24.0 MBytes 202 Mbits/sec [ 4] 83.00-84.00 sec 26.5 MBytes 222 Mbits/sec [ 4] 84.00-85.01 sec 26.2 MBytes 220 Mbits/sec [ 4] 85.01-86.00 sec 25.6 MBytes 216 Mbits/sec [ 4] 86.00-87.00 sec 27.1 MBytes 228 Mbits/sec [ 4] 87.00-88.01 sec 15.4 MBytes 128 Mbits/sec [ 4] 88.01-89.01 sec 3.50 MBytes 29.3 Mbits/sec [ 4] 89.01-90.00 sec 3.50 MBytes 29.6 Mbits/sec [ 4] 90.00-91.00 sec 3.50 MBytes 29.3 Mbits/sec [ 4] 91.00-92.00 sec 3.25 MBytes 27.3 Mbits/sec [ 4] 92.00-93.00 sec 3.50 MBytes 29.3 Mbits/sec [ 4] 93.00-94.01 sec 3.25 MBytes 27.0 Mbits/sec [ 4] 94.01-95.00 sec 3.50 MBytes 29.6 Mbits/sec [ 4] 95.00-96.00 sec 3.50 MBytes 29.3 Mbits/sec
Above is for a "show tech". If I do a "show cpu" it's only good for 1-2 sec.
If I do something else like a "show clients" or "show log system", I see no difference.
Feels like a CPU / interrupt schduling thing, where a show cpu, somehow releases the queue.
Current load during the upload:
a8:bd:27:c0:44:fc# show cpu details Mem: 172944K used, 309008K free, 0K shrd, 0K buff, 28128K cached Load average: 4.07 3.95 3.81 (Status: S=sleeping R=running, W=waiting) PID USER STATUS RSS PPID %CPU %MEM COMMAND 27590 root SW 0 2 16.4 0.0 kworker/1:2 34 root RW 0 2 7.4 0.0 kworker/0:1 2419 root SW 0 2 1.4 0.0 power_monitor 3443 root S < 12520 3367 0.0 2.5 cli 3451 root S N 5700 3367 0.0 1.1 sapd 3543 root S 3936 3367 0.0 0.8 ble_relay 9934 root S < 3520 9933 0.0 0.7 cli 3519 root S 2496 3367 0.0 0.5 mdns 3469 root S < 2456 3367 0.0 0.5 stm 3498 root S 2264 3367 0.0 0.4 snmpd_sap
04-17-2017 02:22 PM
OMG, I've just concluded that the problem actually starts when I ssh into the IAP!
When I have no SSH CLI sessions open everything works fine, but as soon as I open an ssh session the thoughput drops from 200M to about 80M. If I open a second ssh session it drops to about 40M and the cli process starts using about 50% cpu:
a8:bd:27:c0:44:fc# show cpu details Mem: 174756K used, 307196K free, 0K shrd, 0K buff, 28128K cached Load average: 4.17 2.90 2.60 (Status: S=sleeping R=running, W=waiting) PID USER STATUS RSS PPID %CPU %MEM COMMAND 5973 root R < 3316 5972 48.8 0.6 cli 9 root RW 0 2 2.1 0.0 kworker/1:0 7207 root R < 344 6965 1.0 0.0 top 34 root SW 0 2 1.0 0.0 kworker/0:1 3367 root S 1276 1 0.5 0.2 nanny 2420 root SW 0 2 0.5 0.0 power_monitor 3443 root S < 12504 3367 0.0 2.5 cli 3451 root S N 5708 3367 0.0 1.1 sapd 3543 root S 3936 3367 0.0 0.8 ble_relay 6965 root S < 3304 6964 0.0 0.6 cli
With my IAP running 188.8.131.52-184.108.40.206 I could reproduce the behaviour after a factory reset by login into the WebUI, creating a default SSID for 5GHz, PSK.
Only SSH:ing to the IAP did not trigger the problem on the factory reset IAP, but after I issued:
34:fc:b9:c6:6a:0c# conf t now support CLI commit model, please type "commit apply" for configuration to take effect. 34:fc:b9:c6:6a:0c (config) # loginsession timeout 0 34:fc:b9:c6:6a:0c (config) # exit 34:fc:b9:c6:6a:0c# commit apply committing configuration... configuration committed.
The exact same behavior started.
It just sounds far to easy and I just can't believe I've spent the whole weekend throublshooting this without realizing this relation earlier if there is not more behind it causing it not always to behave like this.
Attaching a show tech from the newly factory reset:ed IAP just after the problem started again.
04-18-2017 01:12 PM
I was informed this *might* be fixed in 220.127.116.11, so I've upgraded one of my IAP-315 to that, and yeah, so far it looks promising...
I can no longer reproduce the "ssh cli" high cpu problem and see no drops in my throughput.
I have had my client enter the "40m" state though where I could only upload 40M, but that time I could see BA/A-MPDU was not beeing used, so that I blame my client NIC drivers. After a client reboot it was back to normal again.
Attached is a show-tech while performing the iperf measurement uploading about 200Mbps.
I'll keep my fingers crossed and give it a few days of testing.
PS: I also tested 18.104.22.168 on an IAP-205 and it suffered from the same "ssh cli" bug. Only difference was that it was even worse :) From 200M -> 2m with a single ssh session opened.
04-18-2017 01:54 PM - edited 04-24-2017 12:09 AM
I can also add that I've tested to disable A-MPDU in the IAP with:
(config) # wlan ssid-profile Test (SSID Profile "Test") # mpdu-agg-disable
and can confirm that the TCP throughput drops from 200M -> 40M with only that change.
So, to conclude, it looks like I've been fighting two bugs at the same time:
1) The client NIC (Intel AC-8260) fails to negotiate A-MPDU after some time when running on driver v22.214.171.124. After upgrading to the latest drivers from Intel (126.96.36.199), the problem is resolved. For some reason my first upgrade attempt were non-successful and caused a lot of retransmissions and even worse A-MPDU negotiations.
2) "SSH CLI - loginsession timeout 0" bug in at least IAP 188.8.131.52-184.108.40.206 causing high CPU load and decreased throughput (200->80M for 1 ssh session). Confirmed by TAC as bug id #156250. When "loginsession timeout 0" is configured it causes high CPU load when SSH is connected. Fixed in 220.127.116.11. Problem observed on IAP-205 and IAP-315.
Work-around: Disable loginsession timeout (which is the default) with:
no loginsession timeout
04-20-2017 12:01 PM
Today I did a new attempt to upgrade the Intel drivers on the client to the latest available (18.104.22.168) and this time it actually seems to have fixed the issue!
Last time I tried to upgrade I ended up in some strange state where the client was never able to negotiate Block Ack (A-MPDU) and there was a lot retransmission etc.
I suspect the new driver never got properly updated (Even though it listed the new version when looking at the NIC properties). This time I first uninstalled the whole Inter PROset package before installing the new one. Perhaps that made the difference.
Anyway, for the last 5 hours everything has worked perfect, so I'm keeping my fingers crossed.