This is the support site for Andrews & Arnold Ltd, a UK Internet provider. Information on these pages is generally for our customers but may be useful to others, enjoy!

Back up to the Technical Documents category

Difference between revisions of "TalkTalk and DSCP and 19 second latency"

From AAISP Support Site
Jump to navigation Jump to search
[quality revision][quality revision]
m
Line 202: Line 202:
 
1408 bytes from X.X.X.X: icmp_seq=1 ttl=63 time=8.25 ms
 
1408 bytes from X.X.X.X: icmp_seq=1 ttl=63 time=8.25 ms
 
...
 
...
1408 bytes from 81.187.81.187: icmp_seq=200 ttl=63 time=3084 ms
+
1408 bytes from 81.187.81.187: icmp_seq=200 ttl=63 time='''3084 ms'''
 
--- X.X.X.X ping statistics ---
 
--- X.X.X.X ping statistics ---
 
200 packets transmitted, 200 received, 0% packet loss, time 675ms
 
200 packets transmitted, 200 received, 0% packet loss, time 675ms

Revision as of 13:05, 2 February 2022


Updates:
2021-12-07 Page created
2021-12-08 Updated as FireBrick has software update to change/remove DSCP field
2021-12-09 Updated with reply from TalkTalk
2021-12-13 Updated with reply from Aruba
Further updates from TalkTalk expected middle of the week
Further updates from TalkTalk expected in the new year
2021-02-01 Update from TalkTalk (below) - saying they have fixed it, but our pings still show latency...


This page is about in interesting problem that was reported to us, by a customer, in December 2021. We have written this up so that this page can be found if other people are seeing a similar problem.

We've masked the IP we're pinging in this page as it doesn't matter what IP you ping - in our tests though we were mostly pinging the IP of our LNS, as that is the 'next hop' on the customer's route to the internet.

TL;DR

For some reason, TalkTalk's kit is reading IP DSCP marks from IPv4 packets inside PPPoE, and then putting them in a funny queuing setup which results in latency quickly increasing to over 19 seconds:

1408 bytes from x.x.x.x: icmp_seq=986 ttl=63 time=19084.589 ms


Video:

Thanks

Thank you to the customers (and non-customers) that helped run pings over their lines to help identify that this is a TalkTalk only issue.

In depth:

Home network

the customer's home network is fairly typical, eg:

WiFi devices <-> Aruba AP22 <-> FireBrick 2900 <-> Huawei HG612 (bridge mode) 

There are also a number of devices wired in via a switch and the FireBrick.

The Problem

Our customer moved house in early 2021 and we provided a VDSL line and a FireBrick FB2900 Ethernet router. The VDSL was supplied over TalkTalk back-haul. At the same time the customer installed a set of new Aruba AP22 'Instant-On' access points to cover his new house in Wi-Fi.

The Wi-Fi itself works very well. But since the install the customer soon noticed problems with some 'real time' applications such as webRTC, Google Stadia, Nest Camera video streaming:

  • Nest video camera video streaming was laggy and lossy
  • With Google Stadia, the client software sometimes gives up completely and says the connection isn't good enough to play
  • Video conferencing applications struggled but turning down the bit rate enough helped

The customer put this down to something odd on their network until they finally decided to investigate further.

The Scope

Although A&A only have a single customer who has reported this, we expect anyone in the UK using a TalkTalk ADSL of VDSL line with an Aruba Instant-on access point will be affected with lag on some streaming applications.

The cause

pcaps revealed that the Aruba access point was marking some traffic with the DSCP flag CS6, and when there was enough traffic latency would increasingly build up.

Show me

Here is 100 pings - though to keep the page short, I've included only every 10th ping, but you get the idea:

ping  -i 0.02 -z 192 -s 1400  -c 100 x.x.x.x
PING x.x.x.x (x.x.x.x): 1400 data bytes
1408 bytes from x.x.x.x: icmp_seq=0 ttl=63 time=10.234 ms
Request timeout for icmp_seq 10
1408 bytes from x.x.x.x: icmp_seq=10 ttl=63 time=37.160 ms
1408 bytes from x.x.x.x: icmp_seq=20 ttl=63 time=278.065 ms
1408 bytes from x.x.x.x: icmp_seq=30 ttl=63 time=386.798 ms
Request timeout for icmp_seq 70
1408 bytes from x.x.x.x: icmp_seq=40 ttl=63 time=793.062 ms
Request timeout for icmp_seq 80
1408 bytes from x.x.x.x: icmp_seq=50 ttl=63 time=836.497 ms
1408 bytes from x.x.x.x: icmp_seq=60 ttl=63 time=1040.796 ms
1408 bytes from x.x.x.x: icmp_seq=70 ttl=63 time=1453.866 ms
1408 bytes from x.x.x.x: icmp_seq=80 ttl=63 time=1461.298 ms
1408 bytes from x.x.x.x: icmp_seq=90 ttl=63 time=1935.554 ms
1408 bytes from x.x.x.x: icmp_seq=99 ttl=63 time=1978.546 ms
100 packets transmitted, 100 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 9.734/929.728/2047.155/628.466 ms

This only goes up to just under 2 seconds, but if left, it will peak and continue to run at 19 seconds:

1408 bytes from x.x.x.x: icmp_seq=986 ttl=63 time=19084.589 ms

Whilst latency is shown to traffic tagged with DSCP CS6, all other traffic is unaffected, everything else has normal (good) latency.

Understanding the ping options:

  • -i 0.02 Setting the interval to 1 ping every 20 miliseconds. If the interval in increased (ie fewer pings per second) then the rate of traffic is low enough not to be caught by TalkTalk's traffic shaping policy
  • -z 192 Setting the DSCP bit to 192 (decimal)
  • -s 1400 Setting the packet size to 1400 bytes - quite large. If we reduce this, eg to 700 bytes then it takes longer for the latency to rise. Setting it to 600 and it seems the traffic is low enough not to be caught by TalkTalk's traffic shaping policy
  • -c 100 Just to 100 pings this time

19 second latency

It's amazing to see 19 second latency - this means that a device on TalkTalk's network is storing packets for this amount of time before passing them on. You would usually expect the packets to be dropped (packet loss) - but packetloss has been very low in our tests.

pcaps

We were able to perform pcaps on the customer router (a FB2900) and the LNS. and were able to tell which direction the latency applied. The timestamps show the following:

Ping request:

 Leaves CPE:   12:03:37.438478
 Arrives LNS   12:03:38.495103	

Ping reply:

 Leaves LNS    12:03:38.495106	
 Arrives CPE   12:03:38.505869

In this example, it takes nearly 1 second for the packet to travel from the CPE to our LNS. The reply (from LNS to CPE) is quick.

What's DSCP and CS6?

DSCP (Differentiated Services Code Point ) is a field in the header of IP packets. usually left empty, but a value can be added which will classify how the packets could be handled by network equipment that support QoS (Quality of Service). eg, important packets can be classified has high priority with the hope that they will be able to jump any queues on network routers and get to the destination as fast as possible.

CS6 is one of these classification, and CS6 is described as 'Network control' and is one of the highest classifications available. It us usually meant for packets that contain important router information - such as BGP etc. Packets that an ISP really don't want to drop, and really want to get to the destination as quick as possible.

RFC791 describes the Network Control flag as having:

  • Low Delay
  • High Throughput
  • High Reliability

It seems that we're seeing two out of three of these being applied in our case!

The Aruba is trying to give real-time traffic the highest priority possible.

Note: The classification below CS6 is described as Telephony - which may have been more appropriate, and in our tests, this traffic is unaffected by TalkTalk's network.

Note: DSCP classification are only guidelines and different manufacturers do seem to use them differently.

Things that have been tried...

...that didn't make a difference

  • Disabling the "QoS" setting on the HG612 in bridge mode (still observe high latency)
  • Reducing the "speed" of the PPPoE connection from the FB2900 to 85% of sync speed, hoping to avoid buffer-bloat anywhere in the me-to-A&A direction (still observe high latency)
  • Using other wireless devices (I can repro the problem with the "live view" of some Nest Cameras and with web-based Stadia on a Chromebook)
  • Dumping packets on the WAN interface of the FB2900 (I've confirmed that the FB2900 itself isn't introducing the extra latency)
  • Swapping the modem from a HG612 to a Technicolor in bridge mode

...that did make a difference

  • Connecting the phone, running Stadia, via wired Ethernet (high latency goes away because the problematic QoS marking has gone, DSCP field = 0)
  • Setting a special feature on AAISP and the FireBrick - 'IP over LCP' - this sends the IP traffic as control frames. (high latency goes away). This IPoLCP is a niche feature and has been used in the past to help diagnose problems in back-haul networks: eg: https://www.revk.uk/2015/02/congestion-case-study.html
  • Changing the DSCP value - only packets marked CS6 (192 to 195) are affected. Using values higher or lower and the latency goes away
  • Getting the FireBrick 2900 to set the DSCP field to 0 - this feature was added to the alpha release of FireBrick software on 8th December 2021

Things that were not tried

  • Migrating the line to BT back-haul - this would have fixed the problem for the customer, but would not have fixed the problem in the TalkTalk network or the Aruba access point. Being engineers - we like to fix problems!
  • Changing the DSCP setting on the Aruba Instant-On Access Points - there is no setting, the DSCP field is being set automatically.


Further tests

With A&A having a lively IRC channel, we asked customers to try our ping test to see how far spread the problem was, we found out that:

  • All AAISP TalkTalk VDSL lines tested showed latency
  • All AAISP TalkTalk ADSL lines tested showed latency
  • No AAISP BT lines tested showed latency
  • No AAISP Ethernet lines showed latency

Further, we were also able to test on non-AAISP TalkTalk lines:

Another TalkTalk partner, like us:

400 packets transmitted, 387 packets received, 3.2% packet loss
round-trip min/avg/max/stddev = 20.704/4031.208/8321.078/2455.798 ms

A TT business line sold by TT direct:

1000 packets transmitted, 968 received, 3% packet loss, time 24439ms
rtt min/avg/max/mdev = 16.017/9689.727/19303.816/5771.600 ms, pipe 449

Another TalkTalk partner, like us, actually sees 48 seconds!

round-trip min/avg/max/std-dev = 17.830/33807.255/48113.126/15886.310 ms

So, seems this is a problem within TalkTalks's network, probably affecting all TalkTalk ADSL and VDSL lines in the UK.

Deep Packet Inspection Concerns

TalkTalk provide AAISP the routing of PPP packets from our customer's router (CPE) to our routers (LNS).

It is obvious that our ping tests show that TalkTalk are inspecting the DSCP field in the IP packet within the PPP and then applying that DSCP classification when they pass the PPP frame through their network - applying some sort of queuing rule to it.

One concern that this issue raises is that TalkTalk are inspecting further in to the packet than we'd like or need them to. This may well be by mistake (a miss-configured router within TalkTalk's network), but this is something we're keen to understand and get to the bottom of.

Theories

We and others have come up with a few theories as to what could be happening in TalkTalk:

  • TalkTalk probably want to process cs6 tagged traffic for their own traffic, and probably can't differentiate between their own and customer traffic on some of the devices within their core network.

The fix

There are seemingly two faults here:

  1. Aruba adding the DSCP field - which whilst trying to be helpful is seemingly not configurable, and so in this case it's an unhelpful feature - if it used the DSCP value intended for 'voice' then we'd not have this problem. The customer has opened a support query with Aruba regarding this - Aruba's reply is below.
  2. (in our opinion) TalkTalk should not be looking at the DSCP field. AAISP are taking this up with TalkTalk.

Fault raised with TalkTalk

December 7th

A&A got in touch with TalkTalk directly by emailing TalkTalk's escalations department and our Service Manager. (There was no point in reporting an individual line fault via the normal channels for broadband fault.)

December 9th

TalkTalk are still investigating and are hoping to get back to us next week. They are assuring us on the point about packet inspection, that their policy remains the same in that they are not inspecting traffic in any way and that nor do they have the means to do so.


February 1st 2022

Update from TalkTalk (below) - saying they have fixed it, but our pings still show latency...


The core engineering team have investigated and found that the config for the CoS classifier required adjustment - after discussing and planning the steps, a planned engineering work was scheduled, due to our change freeze it took a little longer to put in place, this has now completed and they request if you can kindly re-test and keep us updated.

sudo ping  -i 0.02 -Q 192 -s 1400  -c 200 X.X.X.X
PING X.X.X.X (X.X.X.X) 1400(1428) bytes of data.
1408 bytes from X.X.X.X: icmp_seq=1 ttl=63 time=8.25 ms
...
1408 bytes from 81.187.81.187: icmp_seq=200 ttl=63 time=3084 ms
--- X.X.X.X ping statistics ---
200 packets transmitted, 200 received, 0% packet loss, time 675ms
rtt min/avg/max/mdev = 8.001/1466.857/3145.106/924.943 ms, pipe 70


To be continued....

Ticket open with Aruba

Apparently These Aruba Access Points do have the ability to open a CLI on the device and disable DSCP. However, it's not so simple and Aruba advise against it because the CLI requires an interactively-generated token from their support staff, and changing the setting back would require another support call.

Current Solution: Customer is currently using the FireBrick feature to set the DSCP field to 0.