TalkTalk and DSCP and 19 second latency: Difference between revisions

Back up to the Technical Documents category
From AAISP Support Site
mNo edit summary
Line 7: Line 7:


1408 bytes from x.x.x.x: icmp_seq=986 ttl=63 time='''19084.589 ms'''
1408 bytes from x.x.x.x: icmp_seq=986 ttl=63 time='''19084.589 ms'''

https://www.dropbox.com/s/rm6c5zsmkddxsq0/High%20ping%20when%20QoS%20set.mov?dl=0


=In depth:=
=In depth:=

Revision as of 20:09, 7 December 2021


This page is about in interesting problem that was reported to us, by a customer, in December 2021.

TL;DR

For some reason, TalkTalk's kit is reading IP DSCP marks from IPv4 packets inside PPPoE, and then putting them in a funny queuing setup which results in latency quickly increasing to over 19 seconds on the route from the End User to A&A:

1408 bytes from x.x.x.x: icmp_seq=986 ttl=63 time=19084.589 ms

https://www.dropbox.com/s/rm6c5zsmkddxsq0/High%20ping%20when%20QoS%20set.mov?dl=0

In depth:

Home network

the customer's home network is fairly typical, eg:

WiFi devices <-> Aruba AP22 <-> FireBrick 2900 <-> Huawei HG612 (bridge mode) 

There are also a number of devices wired in via a switch and the FireBrick.

The Problem

Our customer moved house in early 2021 and we provided a VDSL line, a FireBrick FB2900. The VDSL was supplied over TalkTalk back-haul. At the same time the customer installed a set of new Aruba access points to cover his new house in Wi-Fi.

The Wi-Fi itself works very well. But since the install the customer soon noticed problems with some 'real time' applications such as webRTC, Google Stadia, Nest Camera video streaming. The problem was with latency, which was caused delays with the live streaming of video and audio.

The customer put this down to something odd on their network until they finally decided to investigate further.

The cause

pcaps revealed that the Aruba access point was marking some traffic with the DSCP flag CS6, and when there was enough traffic latency would increasingly build up.

Show me

Here is 100 pings - though to keep the page short, I've included only every 10th ping, but you get the idea:

PING 81.187.81.187 (81.187.81.187): 1400 data bytes
1408 bytes from 81.187.81.187: icmp_seq=0 ttl=63 time=10.234 ms
Request timeout for icmp_seq 10
1408 bytes from 81.187.81.187: icmp_seq=10 ttl=63 time=37.160 ms
1408 bytes from 81.187.81.187: icmp_seq=11 ttl=63 time=16.935 ms
1408 bytes from 81.187.81.187: icmp_seq=20 ttl=63 time=278.065 ms
1408 bytes from 81.187.81.187: icmp_seq=30 ttl=63 time=386.798 ms
Request timeout for icmp_seq 70
1408 bytes from 81.187.81.187: icmp_seq=40 ttl=63 time=793.062 ms
Request timeout for icmp_seq 80
1408 bytes from 81.187.81.187: icmp_seq=50 ttl=63 time=836.497 ms
1408 bytes from 81.187.81.187: icmp_seq=60 ttl=63 time=1040.796 ms
1408 bytes from 81.187.81.187: icmp_seq=70 ttl=63 time=1453.866 ms
1408 bytes from 81.187.81.187: icmp_seq=80 ttl=63 time=1461.298 ms
1408 bytes from 81.187.81.187: icmp_seq=90 ttl=63 time=1935.554 ms
1408 bytes from 81.187.81.187: icmp_seq=99 ttl=63 time=1978.546 ms
100 packets transmitted, 100 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 9.734/929.728/2047.155/628.466 ms

This only goes up to just under 2 seconds, but if left, it will peak at 19 seconds:

1408 bytes from 81.187.81.187: icmp_seq=986 ttl=63 time=19084.589 ms

Whilst latency is shown to traffic tagged with DSCP CS6, all other traffic is unaffected, everything else has normal (good) latency.

pcaps

We were able to perform pcaps on the customer router (a FB2900) and the LNS. and were able to tell which direction the latency applied. The timestamps show the following:

Ping request:

 Leaves CPE:   12:03:38.495103	
 Arrives LNS   12:03:37.438478	

Ping reply:

 Leaves LNS    12:03:38.505869	
 Arrives CPE   12:03:38.495106	

In this example, it takes nearly 1 second for the packet to travel from the CPE to our LNS. The reply (from LNS to CPE) is quick.

What's DSCP and CS6?

DSCP (Differentiated Services Code Point ) is a field in the header of IP packets. usually left empty, but a value can be added which will classify how the packets could be handled by network equipment that support QoS (Quality of Service). eg, important packets can be classified has high priority with the hope that they will be able to jump any queues on network routers and get to the destination as fast as possible.

CS6 is one of these classification, and CS6 is described as 'Network control' and is one of the highest classifications available.

In our case, the Aruba is trying to give real-time traffic the highest priority possible.

Note: The classification below CS6 is described as Telephony - which may have been more appropriate, and in our tests, this traffic is unaffected by TalkTalk's network.

Things that were tried

...that didn't make a difference

  • Disabling the "QoS" setting on the HG612 in bridge mode (still observe high latency)
  • Reducing the "speed" of the PPPoE connection from the FB2900 to 85% of sync speed, hoping to avoid buffer-bloat anywhere in the me-to-A&A direction (still observe high latency)
  • Using other wireless devices (I can repro the problem with the "live view" of some Nest Cameras and with web-based Stadia on a Chromebook)
  • Dumping packets on the WAN interface of the FB2900 (I've confirmed that the FB2900 itself isn't introducing the extra latency)

...that did make a difference

  • Connecting the phone, running Stadia, via wired Ethernet (high latency goes away because the problematic QoS marking has gone, DSCP field = 0)
  • Setting a special feature on AAISP and the FireBrick - 'IP over LCP' - this sends the IP traffic as control frames. (high latency goes away)

Things that were not tried

  • Migrating the line to BT back-haul - this would have fixed the problem for the customer, but would not have fixed the problem in the TalkTalk network or the Aruba access point. Being engineers - we like to fix problems!

Further tests

With A&A having a lively IRC channel, we asked customers to try our ping test to see who far spread the problem was, we found out that:

  • All AAISP TalkTalk VDSL lines tested showed latency
  • All AAISP TalkTalk ADSL lines tested showed latency
  • A TalkTalk business VDSL line provided by TalkTalk direct showed latency
  • A non-AAISP TalkTalk wholesale VDSL showed latency
  • No AAISP BT lines tested showed latency
  • No AAISP Ethernet lines showed latency

So, seems this is a problem within TalkTalks's UK network, probably affecting all TalkTalk ADSL and VDSL lines in the UK.

The fix

Fault raised with TalkTalk

December 7th

A&A got in touch with TalkTalk directly by emailing TalkTalk's escalations department and our Service Manager. (There was no point in reporting an individual line fault via the normal channels for broadband fault.)