Corruption of TCP connections by malformed acknowledgements
===========================================================

Wietse Venema, IBM T.J. Watson Research Center, Hawthorne, NY, USA.

Summary
=======

Recently, several people reported mail delivery failures because
control characters (for example, ^A^A^H) were inserted into their
SMTP connections, resulting in SMTP protocol errors and non-delivery
of email.

These data corruption problems are not host specific: they are
observed with both Linux and BSD/OS systems, and with mail sent to
and/or received from systems running Postfix, Sendmail and qmail.

Over the weekend of March 18, 2000, several people left tcpdump
running on their machines, in order to record corrupted SMTP
sessions.  This report is based on an analysis of that data.

The problem in a nutshell
=========================

In short, the problem is with "extra" ACK packets generated by some
intermediate system. Under some conditions involving retransmission
and/or out-of-order packet arrival, the intermediate system makes
a copy of a real ACK packet, turns the copied ACK around by swapping
its source and destination etc.  fields, and sends it off.

The problem happens when, by mistake, TCP option bytes from the
original ACK packet are sent as DATA bytes in the copied ACK packet.
This corrupts the TCP data stream, because the bogus data is sent
in a packet with correct IP and TCP header checksums. The fact that
the next TCP data will overlay the bogus data does not prevent the
bogus data from being passed up to the application.

Note that the ACK with bogus data is sent towards the host that
sent the original ACK with TCP option bytes.  Turning off TCP
options would prevent this corruption from happening. However,
turning off TCP options in the local system would solve only half
the problem.  When a remote system connects to the local system,
and the remote system has TCP options turned on, the connection
could still suffer the type of data corruption discussed here.

On the basis of timing characteristics of the "extra" ACK packets,
it is possible to narrow down one specific problem instance to only
a few CISCO routers, and one running IOS 11.3.11a, one running IOS
12.08.  At least we know the corruption happens at a particular
side of the Atlantic ocean :-(

Addendum
========

Apparently, once instance of this data corruption problem is caused
by a bandwidth management system.  It runs as a bridge, thus does
not show up in traceroute etc. output.  But the timing information
told us where to look, as discussed below.

This bandwidth management system apparently transforms TCP options
into data, such as the NOP options (code 01). This option was
defined by RFC 973, way back in 1981. There is no excuse for
transforming NOP options into data (Control-A).

Until now, the TCP data corruption problem happens only when one
of the connection endpoints runs a recent LINUX version.

Only recent LINUX versions request the use of timestamp options,
which causes the tell-tale patterns of "01 01 08 0a..." in TCP
packets, and that end up being regurgitated as ^A^A^H^J data.

Example of data corruption
==========================

What follows is a fragment of a corrupted SMTP session, one of
several dozen sessions that were recorded at both endpoints of the
connections. The recordings are available via FTP (see pointers at
the end).

The first figure shows an ACK packet sent by the SMTP server.  The
figure shows one line of tcpdump output (folded for readability),
followed by an annotated version of the packet. The annotation
identifies 20 bytes of IP header fields, 20 bytes of TCP header
fields, and 12 bytes of TCP header options.

    12:28:37.051883 195.52.11.4.25 > 194.25.134.80.1730: . ack 86 win
    32120 <nop,nop,timestamp 1105397 766737219> (DF)

	IP_HDR=20 IP_OPT=0 TCP_HDR=20 TCP_OPT=12 DATA=0 FLAGS=ACK 

	IP_HDR   45  00  00  34  52  2f  40  00  40  06 
		vhl tos len len id  id  off off ttl pro 
	IP_HDR   d1  f2  c3  34  0b  04  c2  19  86  50 
		sum sum src src src src dst dst dst dst 
	TCP_HDR  00  19  06  c2  f5  22  60  dd  f4  ce 
		src src dst dst seq seq seq seq ack ack 
	TCP_HDR  fc  e1  80  10  7d  78  0d  1a  00  00 
		ack ack off flg win win sum sum urp urp 
	TCP_OPT  01  01  08  0a  00  10  dd  f5  2d  b3 
		opt opt opt opt opt opt opt opt opt opt 
	TCP_OPT  7b  43 
		opt opt 

The second figure shows an "extra ACK" packet that was generated
by an intermediate router, not by an end system (it shows up only
in the tcpdump recording of the receiving system).  Note that the
extra packet has the same 0x522f IDENT field in the IP header as
the preceding packet.  The "extra ACK" has the same 12 bytes of
TCP options as the preceding packet.  However, due to an error,
the TCP options are sent as data, so they are read by the application
as ^A^A^H...

    12:28:37.056438 194.25.134.80.1730 > 195.52.11.4.25: . 86:98(12)
    ack 112 win 2920 (DF)

	IP_HDR=20 IP_OPT=0 TCP_HDR=20 TCP_OPT=0 DATA=12 FLAGS=ACK 

	IP_HDR   45  00  00  34  52  2f  40  00  3c  06 
		vhl tos len len id  id  off off ttl pro 
	IP_HDR   d5  f2  c2  19  86  50  c3  34  0b  04 
		sum sum src src src src dst dst dst dst 
	TCP_HDR  06  c2  00  19  f4  ce  fc  e1  f5  22 
		src src dst dst seq seq seq seq ack ack 
	TCP_HDR  60  d5  50  10  0b  68  af  32  00  00 
		ack ack off flg win win sum sum urp urp 
	DATA     01  01  08  0a  00  10  dd  f5  2d  b3 
		 ^A  ^A  ^H  ^J  ^@  ^P  dd  f5  -   b3 
	DATA     7b  43 
		 {   C  

Pair-wise comparison of client-side and server-side tcpdump recordings
======================================================================

Gerald Richter, one of the people who noticed this data corruption
problem, was kind enough to engage in a measurement where his system
(domain ecos.de in Europe) repeatedly sent mail to my system (domain
porcupine.org in the USA) over a period of about 24 hours, while
tcpdump was recording the SMTP sessions on both end systems.

    rt-h1.ecos.de ... intermediate router .......... spike.porcupine.org
        <---extra ACK---     ---extra ACK--->

The command used was:

    tcpdump -i xxx -s 2000 -w filename host yyy and port 25

    xxx = name of interface if using a non-default one (example: ppp0)
    yyy = name of remote host

Pairwise comparison of these tcpdump recordings shows that "extra"
ACK packets are generated by an intermediate node, not by the end
systems.

In particular, recordings made at spike.porcupine.org show 9 "extra"
ACKs being sent to spike.porcupine.org, while recordings made at
rt-h1.ecos.de show 62 "extra" ACKs being sent to rt-h1.ecos.de.
The actual recordings are provided in an appendix below.

Pairwise comparison of the recordings also reveals that an intermediate
router is changing the TCP window size fields of packets in transit.
Whether this transformation is done by the same router that sends
the "extra ACK" packets is unknown. [With more information available,
the answer now is "yes".]

Packet arrival time analysis
============================

As discussed above, some intermediate systems generate an "extra
ACK" by cloning a real ACK packet. They modify the cloned ACK by
swapping source and destination fields etc., then send it off.  In
particular, the "extra ACK" carries the same IDENT field in the IP
header as the real ACK, and carries the same TCP option bytes as
the real ACK. However, by mistake the TCP option bytes are sent as
data.

By measuring the time differences between sending the real ACK and
receiving the cloned ACK we can narrow down the router that is
responsible for the data corruption. By playing games with tools
such as traceroute, ping and mtr (http://www.bitwizard.nl/mtr/) it
is possible to further identify the source of a problem.

The figure below gives the 12 smallest time differences between
sending the real ACK and receiving the cloned ACK, as observed by
host rt-h1.ecos.de. Each time delay is identified by the initial
TCP sequence number of the corresponding TCP session.

The same sessions were also recorded by a second machine at the
ecos.de network. The "true ack to bogus ack" delays observed by
that second machine are about 0.1 ms smaller, so the distortion of
running tcpdump on the sending machine itself is negligible.

    Delay(s)	Initial TCP sequence number
    ===========================================
    0.004441	796371957:796371957
    0.004449	114768670:114768670
    0.004449	2626353781:2626353781
    0.004454	4029706621:4029706621
    0.004496	1142009544:1142009544
    0.004497	3360837077:3360837077
    0.004503	538029698:538029698
    0.004512	2594745479:2594745479
    0.004514	3297579444:3297579444
    0.004514	498610614:498610614
    0.004521	380728081:380728081
    0.004526	306389737:306389737
    [50 more entries omitted]

Thus, the round-trip time from rt-h1.ecos.de to the "nasty" box is
about 4.5 ms. On the basis of mtr and ping information (appendix),
this limits the number of candidate "nasty" intermediate systems
to two routers, router gw4-mz.tap.de with CISCO IOS 11.3.11a and
hs1-0.mz0.nacamar.net (gw1-mz.tap.de) with CISCO with IOS 12.08.

As was found out later, the problem was not with either of these
routers.  The bandwidth management system responsible for the
problems sits in-between these two routers. So, the packet timing
measurements proved to be useful after all. They just need to be
used carefully.

Appendix 1: mtr timings from ecos.de to porcupine.org
=====================================================

HOST                                    LOSS  RCVD SENT    BEST     AVG  WORST
gw-nacamar.ecos.de                        0%   100  100    1.38    2.28  74.22
195.52.15.45                              0%   100  100    4.84    5.53  56.51
gw1-mz.tap.de                             0%   100  100    5.07    8.08 125.18
hs6-1.ffm10.nacamar.net                   0%   100  100    6.70   10.44 217.32
fe4-0-0.ffm4.nacamar.net                  0%   100  100    7.28   13.39 128.38
atm3-0.sprintnap.nacamar.net              0%   100  100   88.09  100.31 321.34
sprint-nap.att.net                        0%   100  100   88.09   91.46 207.35
gbr2-p02.n54ny.ip.att.net                 0%   100  100   91.62   93.61 137.66
ar10-a300s1.n54ny.ip.att.net              0%   100  100   92.22   97.67 294.02
12.125.56.194                             0%   100  100   88.97   91.41 210.89
border11.ge2-0-bbnet1.nyc.pnap.net        0%   100  100   88.69   90.70 171.19
cloud9-3.border11.nyc.pnap.net            0%   100  100   90.39   97.65 201.04
ra8.cloud9.net                            0%   100  100   91.14   93.99 137.79
spike.porcupine.org                       1%    99  100  210.85 1101.87 12535.04

Appendix 2: ping timings from ecos.de to the nearest routers
============================================================

195.52.11.1	round-trip min/avg/max = 1.512/1.673/2.416 ms
195.52.15.45	round-trip min/avg/max = 5.142/5.297/5.986 ms
194.162.200.65	round-trip min/avg/max = 5.595/7.151/12.366 ms
194.112.20.241	round-trip min/avg/max = 7.597/35.392/155.918 ms

Appendix 3: mtr timings from porcupine.org to ecos.de
=====================================================

Hostname                                %Loss  Rcv  Snt  Last Best  Avg  Worst
 1. darjeeling-E3-3.wpn.cloud9.net         0%  227  227     3    0    6    227
 2. border11.s7-8.cloud9-3.nyc.pnap.net    0%  227  227     7    1    4     25
 3. core2.fe0-0-fenet1.nyc.pnap.net        0%  227  227     2    2    8    266
 4. POS2-2.GW12.NYC4.ALTER.NET             0%  227  227     9    2    6     40
 5. 505.ATM3-0.XR1.NYC4.ALTER.NET          0%  227  227     2    2    6     52
 6. 189.ATM9-0-0.GW1.NYC4.ALTER.NET        0%  226  226     3    2    7    178
 7. nacamar-gw.customer.ALTER.NET          0%  226  226     8    6   13    239
 8. atm5-0.ffm10.nacamar.net               0%  226  226    86   84   96    335
 9. hs1-0.mz0.nacamar.net                  0%  226  226    89   85   91    264
10. gw4-mz.tap.de                          0%  226  226    89   86   91    118
11. ???
12. rt-h1.ecos.de                          0%  226  226   100   90   95    195

Appendix 4: tcpdump recordings and tools
========================================

The tcpdump recordings of all corrupted SMTP sessions, and the
tools used for their analyisis are made available separately:

    ftp://ftp.porcupine.org/pub/debugging/ack-corruption.tar.gz
    ftp://ftp.porcupine.org/pub/debugging/ack-corruption.tar.gz.sig

This file contains for following information:

1 - An updated copy of this document.

2 - Two directories with tcpdump recordings, one for each host that
    received the bogus ACK packets:

	ack-to-ecos.de		bogus ACK sent to rt-h1.ecos.de
	ack-to-porcupine.org	bogus ACK sent to spike.porcupine.org

    Each ack-to-<location> directory contains files named
    <isn>-<location>, named after the initial TCP sequence number
    of the corresponding connection, and after the location where
    the recording was made.

    Each ack-to-<location> directory contains a file delay-ack-<location>
    listing all observed delays between sending the real ACK and receiving
    the bogus ACK.

    Each tcpdump recording contains tcpdump output that is annotated
    with IP and TCP header field names. The files were produced with

	tcpdump -xnr filename port foo | tcpdumpx

3 - One directory named tools, with source code for tcpdumpx and
    for other tools used for this analysis.
