Explicit Congestion Notification (ECN) and Congestion Feedback Using the Network Service Header (NSH) and IPFIX

Explicit Congestion Notification (ECN) and Congestion Feedback Using the Network Service Header (NSH) and IPFIX Independent

2386 Panoramic Circle Apopka Florida 32703 USA +1-508-333-2270 d3e3e3@gmail.com

Independent

UK ietf@bobbriscoe.net http://bobbriscoe.net/

Huawei Technologies

101 Software Avenue Nanjing Jiangsu 210012 China +86-25-56624584 zhuangshunwan@huawei.com

Malis Consulting

USA agmalis@gmail.com

Huawei Technologies

Beiqing Rd. Z-park No.156, Haidian District Beijing 100095 China weixinpeng@huawei.com

Routing SFC Working Group NSH ECN SFC congestion Explicit Congestion Notification (ECN) allows a forwarding element to notify downstream devices of the onset of congestion without having to drop packets. Coupled with a means to feed information about congestion back to upstream nodes, this can improve network efficiency through better congestion control, frequently without packet drops. This document specifies ECN and congestion feedback support within a Service Function Chaining (SFC) enabled domain through use of the Network Service Header (NSH, RFC 8300) and IP Flow Information Export (IPFIX, RFC 7011) protocol.

Introduction Explicit Congestion Notification (ECN ) allows a forwarding element to notify downstream nodes of the onset of congestion without having to drop packets. Coupled with a means to feed information about congestion back to upstream nodes, this can improve network efficiency through better congestion control, frequently without packet drops. This document specifies ECN and congestion feedback support within a Service Function Chaining (SFC ) enabled domain through use of the Network Service Header (NSH ) and IP Flow Information Export (IPFIX ) protocol. This document requires that all ingress and egress nodes of the SFC domain, for the flows to which these techniques are applied, implement ECN and that ingress and egress nodes are coordinated in that they implement the ingress and egress procedures herein specified including IPFIX between the ingress and egress nodes. While congestion management will be the most effective if all interior nodes of the SFC enabled domain transited by those flows implement ECN, some benefit is obtained even if some of those nodes do not implement ECN. Congestion at any interior bottleneck where ECN marking is not implemented will be unmanaged. Ths solution specified in this document is not suitable for portions of a network within which there are paths passing through areas under differnet administrative control or where the ingress and egress nodes of that network portion are not coordinated. The following subsections provide background information on NSH, ECN, congestion feedback through IPFIX, and terminology used in this document.

NSH Background The Service Function Chaining (SFC ) architecture calls for the encapsulation of traffic within a service function chaining domain with a Network Service Header (NSH ) added by a "Classifier" (ingress node) on entry to the domain with the NSH being removed on egress from the domain at the egress node. The NSH is used to control the path of a packet in the SFC domain.

Example SFC Forwarding Nodes Path shows an SFC enabled domain for the purpose of illustrating the use of the NSH. Traffic passes through a sequence of Service Function Forwarders (SFFs) each of which sends the traffic to one or more Service Functions (SFs). Each SF performs some operation on the traffic, for example firewalling or Network Address Translation (NAT) or load balancing, and then returns the traffic to the SFF from which it was received. Logically, during the transit of each SFF, the outer transport header that got the packet to the SFF is stripped (see ), the SFF decides on the next forwarding step, either adding a new outer transport header or, if the SFF is the egress/end, removing the NSH header. The outer transport headers added may be different in different regions of the SFC enabled domain. For example, IP could be used for some SFF-to-SFF communication and MPLS used for other SFF- to-SFF communication.

ECN Background Explicit Congestion Notification (ECN ) allows a forwarding element (such as a router or a Service Function Forwarder (SFF) or Service Function (SF)) to notify downstream nodes of the onset of congestion without having to drop packets. This can be used as an element in active queue management (AQM) to improve network efficiency through better traffic control without packet drops. The forwarding element can explicitly mark some packets in an ECN field instead of dropping the packet. For example, a two-bit field is available for ECN marking in IP headers .

Tunnel Congestion Feedback Background Tunnels are widely deployed in various networks including data center networks, enterprise networks, and the public Internet. A tunnel consists of ingress, egress, and a set of intermediate nodes including routers. Tunnel Congestion Feedback () is a building block for congestion mitigation methods. It supports feedback of congestion information from an egress node to an ingress node. This document treats paths in the SFC enabled domain as tunnels with the initial Classifier node being the ingress; however, the tunnel congestion feedback facilities specified in this document MAY be used in contexts other than SFC. Any action by a tunnel ingress to reduce congestion needs to allow sufficient time for the end-to-end congestion control loop to respond first, for instance by the ingress taking a smoothed average of the level of congestion signaled by feedback from the tunnel egress or delaying any action for at least the worst case end-to-end round-trip time (for example, 200 milliseconds). Otherwise, the system could become unstable. Examples of actions that can be taken by an ingress node when it has knowledge of downstream congestion include those listed below. Details of implementing these traffic control methods, beyond those given here, are outside the scope of this document.

(1): Traffic throttling (policing), where the downstream traffic flowing out of the ingress node is limited to reduce or eliminate congestion.
(2): Upstream congestion feedback, where the ingress node sends messages indicating congestion upstream to or towards the ultimate traffic source, a function that can throttle traffic generation/transmission.
(3): Traffic re-direction, where the ingress node configures the NSH of some future traffic so that it avoids congested paths. Great care must be taken with this option to avoid (a) significant re- ordering of traffic in flows that it is desirable to keep in order due to end-to-end requirements or due to a stateful SF and (b) oscillation/instability in traffic paths due to alternate congestion of previously idle paths and the idling of previously congested paths. For example, it is preferable to classify traffic into flows of a sufficiently coarse granularity that the flows are long lived and to use a stable path per flow, sending only newly appearing flows on apparently uncongested paths rather than changing the path for any already existing flow.

shows an example path from an original sender to a final receiver passing through a chain of service functions between the ingress and egress of an SFC enabled domain. The path is likely to pass through other network nodes outside the SFC enabled domain (not shown) before entering that domain and after leaving that domain. shows typical congestion feedback that would be expected from the final receiver to the origin sender, which controls the load the origin sender directs to elements on the path. The figure also shows the congestion feedback from the egress to the ingress of the SFC enabled domain that is described in this document, to control or balance load within that domain.

Congestion Feedback across an SFC enabled Domain - - - - - - - - - - - - - - ->- - - - - -- - - - - - ->- | | | | | | | | .:= = = = = = = = = = = = = = = = = = = = = =:. | | | | _||_ Tunnel Congestion Feedback || | | | | \ / || | | | | \/ || | | | | __ NSH __ | | | | | |-------------------------->--------------| | | | | |. . . | | ___ ___ ___ | |. . .| | | | | | OT1 | | OT4 | | . . . | | OTn | | | | | | | |-->--|SFF|--->---|SFF| |SFF|-->--| | | | |__| |__| |___| |___| |___| |__| |__| origin SFC | ^ | ^ SFC final sender domain OT2| |OT3 OT6| |OT7 domain rcvr ingress v | v | egress +---+ +---+ SFF |SF | |SF | +---+ +---+ ]]> SFC enabled Domain congestion feedback in is shown within the context of an end-to-end congestion feedback loop. Also shown is the encapsulated layering of NSH headers within a series of outer transport headers (OT1, OT2, ... OTn). is simplified as there might be multiple egress nodes and some of them may be final receivers for particular packets. (See .)

Conventions Used in This Document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here. Acronyms:

AQM -: Active Queue Management
CE -: Congestion Experienced
DDoS -: Distributed Denial of Service
downstream -: The direction from ingress to egress
ECN -: Explicit Congestion Notification
ECT -: ECN Capable Transport
IPFIX -: IP Flow Information Export
Not-ECT -: Not ECN-Capable Transport
NSH -: Network Service Header
SF -: Service Function
SFC -: Service Function Chaining
SFF -: Service Function Forwarder - A type of node that forwards based on the NSH.
SPI -: Service Path Identifier
TLV -: Type Length Value
upstream -: The direction from egress to ingress

The NSH ECN Field The NSH is used to encapsulate traffic and control its subsequent path (see Section 2 of ). The NSH also provides for optional metadata inclusion, as shown in .

Data Encapsulation with the NSH This document assigns two currently unused bits (indicated by "U") in the NSH Base Header (Section 2.2 of ) for the purpose of ECN indication as shown in .

Updated NSH Base Header RFC Editor NOTE: The above figure should be adjusted based on the bits actually assigned by IANA (see ) and this note deleted. shows the meaning of the code points in the NSH ECN field. These have the same meaning as the ECN field code points in the IPv4 or IPv6 header as defined in Section 23.1 of . ECN Field Code Points

Binary	Name	Meaning
00	Not-ECT	Not ECN-Capable Transport
01	ECT(1)	ECN-Capable Transport
10	ECT(0)	ECN-Capable Transport
11	CE	Congestion Experienced

ECN Support in the NSH This section describes the required behavior to support ECN using the NSH. There are two aspects to ECN support:

ECN propagation during ingress or egress;
ECN marking during congestion at bottlenecks.

While this section covers all combinations of ECN-aware and ECN- unaware, it is expected that in most cases the NSH domain will be uniform so that, if this document is applicable, all SFFs will support ECN; however, some SFs might not support ECN. ECN Propagation: The specification of ECN tunneling explains that an ingress must not propagate ECN support into an encapsulating header unless the egress supports correct onward propagation of the ECN field during decapsulation. We define Compliant ECN Decapsulation here as decapsulation compliant with either or an earlier compatible equivalent (, or the full functionality mode of ). The procedures in ensure that each ingress of the transport links within the SFC enabled domain does not propagate ECN support into the encapsulating outer transport header unless the corresponding egress of that link supports Compliant ECN Decapsulation. requires that all the egress nodes of the SFC enabled domain that continue to propagate a packet support Compliant ECN Decapsulation in conjunction with tunnel congestion feedback; otherwise the scheme in this document will not work. (An SFC domain may have nodes that terminate packets and thus are logically "egress" nodes but for which further propagation of ECN is meaningless.) ECN Marking: At transit nodes the marking behavior specified in is recommended and if not implemented at such transit nodes, there may be unmanaged congestion. Detection of congestion will be most effective if ECN marking is supported by all potential bottlenecks inside the domain in which NSH is being used to route traffic as well as at the ingress and egress. Nodes that do not support ECN marking, or that support AQM but not ECN, will naturally use drop to relieve congestion. The gap in the end-to-end packet sequence will be detected as congestion by the final receiving endpoint, but not by the NSH egress (see ).

At The Ingress When the ingress/Classifier encapsulates an incoming packet with an NSH, it MUST set the NSH ECN field using the "Normal mode" specified in (e.g., copied from the incoming IP header). Then, if the resulting NSH ECN field is Not-ECT, the ingress SHOULD set it to ECT(0). This indicates that, even though the end-to-end transport is not ECN-capable, the egress and ingress of the SFC enabled domain are acting as an ECN-capable transport. This approach supports and is interoperable with all known variants of ECN, including the experimental L4S capability . This "faked ECT" marking at the ingress is necessary for ECN to measure congestion within the SFC domain. It only affects marking within the SFC domain and is undone for packets that pass through an SFC domain egress. Packets arriving at the ingress might not use IP. If the protocol of arriving packets supports an ECN field similar to IP, for example MPLS , the procedures for IP packets can be used. If arriving packets do not support an ECN field similar to IP, they MUST be treated as if they are Not-ECT IP packets. Then, as the NSH encapsulated packet is further encapsulated with a transport header, if ECN marking is available for that transport (as it is for IP and MPLS ), the ECN field of the transport header MUST be set using the "Normal mode" specified in (i.e., copied from the NSH ECN field). A summary of these normative steps is given in . Setting of ECN fields by an Ingress/Classifier

Incoming Header (also equal to departing Inner Header	Departing NSH and Outer Headers
Not-ECT	ECT(0)
ECT(0)	ECT(0)
ECT(1)	ECT(1)
CE	CE

The requirements in this section apply to all ingress nodes for the domain in which an NSH is being used to steer traffic.

At Transit Nodes This section describes the behavior at nodes that forward based on the NSH such as SFF and other forwarding nodes such as IP routers. shows a packet on the wire between forwarding nodes.

Packet in Transit There can be nodes implementing firewall, DDoS, or similar functions that conditionally discard packets. When they do discard a packet, they are an egress node (see ), not a transit node.

At NSH Transit Nodes When a packet is received at an NSH based forwarding node such as an SFF, say N1, the outer transport encapsulation is removed and its ECN marking SHOULD be combined into the NSH ECN marking as specified in . If this is not done, any congestion encountered at non-NSH transit nodes between N1 and the previous upstream NSH based forwarding node will be lost and not transmitted downstream. The NSH forwarding node SHOULD use a recognized AQM algorithm to detect congestion. If the NSH ECN field indicates ECT, it will probabilistically set the NSH ECN field to the Congestion Experienced (CE) value or, in cases of extreme congestion, drop the packet. When the NSH encapsulated packet is further encapsulated for transmission to the next SFF or SF, ECN marking behavior depends on whether or not the node that will decapsulate the outer header supports Compliant ECN Decapsulation (see ). If it does, then the encapsulating node propagates the NSH ECN field to this outer encapsulation using the "Normal Mode" of ECN encapsulation (the ECN field is copied). If it does not, then the encapsulating node MUST clear ECN in the outer encapsulation to non-ECT (the "Compatibility Mode" of ).

At an SF/Proxy If the SF is NSH and ECN-aware, the processing is essentially the same at the SF as at an SFF as discussed in (except in the case where the SF terminates the packets path). If the SF is NSH-aware but ECN-unaware, then the SFF transmitting the packet to the SF will use Compatibility Mode. Congestion encountered in the SFF to SF and SF to SFF paths or internal to the SF will be unmanaged. If the SF is not NSH-aware, then an NSH proxy will be between the SFF and the SF to avoid exposure of the NSH-ignorant SF to NSHs as shown in . This is described in Section 4.6 of . The SF and proxy together look to the SFF like an NSH-aware SF. The behavior at the proxy and SF in this case is as below: If such a proxy is not ECN-aware, then congestion in the entire path from SFF to proxy to SF back to proxy to SFF will be unmanaged.

Proxy for NSH Un-aware SFF | NSH +---->|un-aware | |(Service | | aware | | SF | | Function |<----+ proxy |<----+(Service | |Forwarder)| +-------+ |Function)| +----------+ +---------+ | v ]]> If the proxy is ECN-aware, the proxy uses an AQM to indicate congestion within the proxy in the NSH that it returns to the SFF. The outer header used for the proxy-to-SF path uses Normal Mode. The outer header used for the proxy-to-SFF path uses Normal Mode based copying of the NSH ECN field to the outer header. Thus congestion in the proxy will be managed. Congestion in the SF will be managed only if the SF is ECN-aware and implements an AQM.

At Other Forwarding Nodes Other forwarding nodes, that is non-NSH forwarding nodes between NSH forwarding nodes, such as IP or label switched routers, bridges, or other devices, might also contain potential bottlenecks. If so, they SHOULD implement an AQM algorithm to update the ECN marking in the outer transport header as specified in .

At Egress/End At an SFC enabled domain egress node, first any actions are taken based on Congestion Experienced or other values of ECN marking, such as accumulating statistics to send back to the ingress (see ) or for other uses. There can be nodes implementing firewall, DDoS, or similar functions that then discard the packet. If the packet is so discarded, no further actions are needed. If the packet is to be propagated and is carried inside the NSH as encapsulated IP, then when the NSH is removed the NSH ECN field MUST be combined with the IP ECN field as specified in that was extracted from Section 3.2 of . This requirement applies to all egress nodes for the domain in which an NSH is being used to route traffic. Egress ECN Fields Merger (Source )

Arriving Inner Header	Arriving Outer Header
Arriving Inner Header	Not-ECT	ECT(0)	ECT(1)	CE
Not-ECT	Not-ECT	Not-ECT	Not-ECT	<drop>
ECT(0)	ECT(0)	ECT(0)	ECT(0)	CE
ECT(1)	ECT(1)	ECT(1)	ECT(1)	CE
CE	CE	CE	CE	CE

All the egress nodes of the SFC enabled domain that can propagate NSH encapsulated packets MUST support Compliant ECN Decapsulation as specified in this section. If this is not the case, the scheme described in this document will not work.

Congestion Statistics and More Complex Cases The SFC specification permits an SF to absorb packets and to generate new packets as well as simply processing and returning the packets it receives to an SFF. Such actions might appear to be packet loss due to congestion or might mask the loss of packets by generating additional packets. The closer a particular application of SFC is to a simple tunnel with a single ingress and egress, the simpler it is to accurately use the techniques in this document. Where there is a single ingress but multiple egress nodes (where a node that discards a packet counts as an egress) these techniques can still work well if all egress nodes feedback congestion information to that ingress. Multiple ingress nodes are a substantial complication, but similar techniques may still work in some cases if multiple physical ingress nodes can coordinate to act as one logical ingress node; methods for such coordination are beyond the scope of this document. Use of the techniques in this document for a flow with multiple egress and uncoordinated ingress nodes is NOT RECOMMENDED, although there might be some cases where these techniques could be elements in some sort of beneficial scheme; such schemes are beyond the scope of this document. The tunnel congestion feedback approach () can detect congestions in several ways. One way detects traffic loss by counting payload packets and bytes in at the ingress and counting them out at the egress. This does not work unless nodes conserve the number of payload packets and/or bytes. Therefore, it will not be possible to accurately detect packet loss using this technique if traffic volume, as measured by the metric in use (packets or bytes), is not conserved by the service function chain processing that traffic. Nonetheless, if a bottleneck supports ECN marking, it will be possible to detect the high level of CE markings that are associated with congestion at that bottleneck by looking at the ratio of CE- marked to non-CE-marked packets. However, it will not be possible to detect any congestion based on ECN marking, whether slight or severe, if it occurs at a bottleneck that does not support ECN marking.

Tunnel Congestion Feedback Support The collection and storage of congestion information at an egress can be useful for later analysis and MAY be used without the feedback mechanisms specified in this Section. However, if congestion information is not fed back to a point which can act to reduce congestion, it will not be useful in real time. Such congestion feedback to the ingress enables the ingress to take actions such as those listed in . IP Flow Information Export (IPFIX ) provides a standard for communicating traffic flow statistics. As extended by this document, IPFIX messages from the egress to the ingress are used to communicate the extent of congestion between an ingress and egress based on ECN marking in the NSH and traffic statistics. Each egress MUST be able to identify the relevant ingress for a packet based on information in the packet such as the SPI or the Ingress Network Node Information Context Header .

Congestion Level Measurements The congestion level measurements are based on ECN marking in the NSH and packet drop detection. In particular, congestion information includes at least one of the following:

cumulative byte counts of packets with each type of outer/inner header ECN marking combination,
the ratio of CE-marked packets to all packets, and
the ratio of dropped packets to all packets.

All IPFIX messages are time stamped . So, for example, it is possible to compute rates of packets or packets with various ECN labeling from two IPFIX messages that have cumulative counts and time stamps. An earlier count and time can be deducted from a later count and time to give the time interval and count during that interval. If the congestion level is low enough, the packets are marked as CE instead of being dropped, and then the congestion level can be calculated according to the ratio of CE-marked packets. If the congestion level is so high that ECT packets will be dropped, then the packet loss ratio can be calculated by comparing total packets entering ingress and total packets arriving at egress over the same span of packets. Note that a node that discards packets for firewall, DDoS, or similar reasons counts as an egress. If packet loss, other than such deliberate discard, is detected, then it can be assumed that severe congestion has occurred. Faked ECN-Capable Transport (ECT) is used at the ingress to defer packet loss to the egress. The basic idea of faked ECT is that, when encapsulating packets, the ingress first marks the tunnel outer header according to , and then remarks the outer header of Not-ECT packets as ECT. (ECT(0) and ECT(1) are treated as the same.) In this case, the NSH is treated as the tunnel outer header because it will be present for the entire SFC enabled domain transit while transport headers may change. Thus, as transmitted by the ingress node, there will be one of three combinations of outer header ECN field and inner header ECN field as follows: CE|CE, ECT|N-ECT, and ECT|ECT (in the format of outer-ECN|inner-ECN); when decapsulating packets at the egress, defined decapsulation behavior is used, and according to , the packets marked as CE|N-ECT will be dropped. Faked-ECT is used to shift some drops to the egress in order to allow the egress to calculate the CE-marked packet counts and ratio more precisely. The ingress encapsulates packets and marks their outer header according to faked ECT as described above. The ingress cumulatively counts packet bytes for three types of ECN combination (CE|CE, ECT|N- ECT, and ECT|ECT) and then the ingress regularly sends cumulative byte counts message of each type of ECN combination to the egress. When each message arrives at the egress, the following two steps occur: (1) the egress calculates the ratio of CE-marked packets; (2) the egress cumulatively counts packet bytes coming from the ingress and adds its own bytes counts of each type of ECN combination (CE|CE, ECT|N-ECT, CE|N-ECT, CE|ECT, and ECT|ECT) to the message for the ingress to calculate packet loss. The egress feeds back the CE-marked packet ratio, packet loss ratio, byte counts information, and the like to the ingress as requested for evaluating congestion level in the tunnel. The egress calculates the CE-marked packet ratio by counting packets with different ECN markings. The CE-marked packet ratio can be used as an indication of tunnel load level. For example, the tunnelEcnCEMarkedRatio field (specified below) indicates the fraction of traffic that has been marked in the ECN field of the NSH as Congestion Experienced (CE). It is assumed that nodes between the ingress and egress will not drop packets biased towards certain ECN codepoints, so calculating of CE-marked packet ratio is not affected by packet drop. The calculation of the fraction of packets dropped is by comparing the traffic volumes between ingress and egress. In the case of multiple egresses, the ingress can combine their reports. Statistics of number of packets or bytes can simply be added. Statistics of percentage or ratio of particular ECN marking can be averaged with reports from different egresses weighted by the number of packets processed by that egress. The statistics can be at the granularity of all traffic from the ingress to the egress to learn about the overall congestion status of the path between the ingress and the egress or at the granularity of individual customer's traffic or a specific set of flows to learn about their congestion contribution.

Congestion Information Delivery As described above, the tunnel ingress sends a message containing cumulative byte counts of packets of each type of ECN marking to the tunnel egress, and the tunnel egress feeds back messages to the ingress with at least one of the following: cumulative byte counts of packets of each type of ECN combination, the ratio of CE-marked packets to all packets, and/or the ratio of dropped packets to all packets. It is possible for these messages to contribute to congestion. This section specifies how the messages are conveyed. IPFIX recommends, but does not require, use of SCTP in partial reliability mode for the transport of its messages. This mode allows loss of some packets, which is tolerable because IPFIX communicates cumulative statistics. IPFIX over SCTP over IP SHOULD be used directly where there is IP connectivity between the ingress and egress; however, there might be different transport protocols or address spaces used in different regions of an SFC enabled domain that block such direct IP connectivity. The NSH provides the general method of routing traffic within an SFC enabled domain so the encapsulation of the required IPFIX traffic in NSH MUST be implemented and, when IP connectivity is not available, IPFIX over NSH, as specified in , SHOULD be used along with configuration of appropriate SFC paths for the IPFIX over NSH traffic. Other methods MAY be used in particular SFC domains which support them, such as IPFIX over MPLS. IPFIX messages could travel along the same path as network data traffic. In any case, an IPFIX message packet may get lost in case of network congestion. Even though the missing information could be recovered because of the use of cumulative counts, IPFIX messages SHOULD be transmitted at a higher priority than users' traffic flows to improve the promptness of congestion information feedback. The ingress node can do congestion management at different granularity which means both the overall aggregate congestion level and congestion level contributed by certain traffic flows could be measured for different congestion management purposes. For example, if the ingress only wants to limit congestion volume caused by certain traffic flows, such as UDP-based traffic, then congestion volume for that traffic can be fed back; or if the ingress is doing overall congestion management, the aggregated congestion volume can be fed back. When sending IPFIX messages from ingress to egress, the ingress acts as IPFIX exporter and the egress acts as IPFIX collector. When feeding back congestion level information from egress to ingress, the egress acts as IPFIX exporter and ingress acts as IPFIX collector. The combination of congestion level measurement and congestion information delivery procedures are as following:

The ingress node determines the IPFIX template record to be used. The template record can be pre-configured or determined at runtime, the content of the template record will be determined according to the granularity of congestion management; if the ingress wants to limit congestion volume contributed by specific traffic flows then the elements such as source IP address, destination IP address, flow ID, and CE-marked packet volume of the flows, etc., will be included in the template record.
Metering at the ingress measures traffic volume according to the template record chosen and then the measurement records are sent to the egress.
Metering on the egress measures congestion level information according to template record which, in simple cases, SHOULD be the same as the template record sent by the ingress (see ).
The egress sends its measurement records together with the measurement records of the ingress back to the ingress.

IPFIX Extensions This section specifies the new IPFIX Information Elements needed. It conforms to .

nshServicePathID In order to identify SFC flows, so that congestion can be measured and reported at that granularity, it is necessary for IPFIX to be able to classify traffic based on the Service Path Identifier (SPI) field of the NSH . Thus, an NSH Service Path Identifier (nshServicePathID) IPFIX Information Element is specified. Name: nshServicePathID Description: Network Service Header Service Path Identifier. This is a 24-bit value which is left justified in the Information Element. The low order byte MUST be sent as zero and ignored on receipt. Abstract Data Type: unsigned32 Data Type Semantics: identifier ElementId: TBD0 Status: current

tunnelEcnCeCeByteTotalCount Description: The total number of bytes of incoming packets with the CE|CE ECN marking combination at the Observation Point since the Metering Process (re-)initialization for this Observation Point. Abstract Data Type: unsigned64 Data Type Semantics: totalCounter ElementId: TBD1 Status: current Units: bytes

tunnelEcnEctNectBytetTotalCount Description: The total number of bytes of incoming packets with the ECT|N-ECT ECN marking combination (ECT(0) and ECT(1) are treated the same as each other) at the Observation Point since the Metering Process (re-)initialization for this Observation Point. Abstract Data Type: unsigned64 Data Type Semantics: totalCounter ElementId: TBD2 Status: current Units: bytes

tunnelEcnCeNectByteTotalCount Description: The total number of bytes of incoming packets with the CE|N-ECT ECN marking combination at the Observation Point since the Metering Process (re-)initialization for this Observation Point. Abstract Data Type: unsigned64 Data Type Semantics: totalCounter ElementId: TBD3 Status: current Units: bytes

tunnelEcnCeEctByteTotalCount Description: The total number of bytes of incoming packets with the CE|ECT ECN marking combination (ECT(0) and ECT(1) are treated the same as each other) at the Observation Point since the Metering Process (re-)initialization for this Observation Point. Abstract Data Type: unsigned64 Data Type Semantics: totalCounter ElementId: TBD4 Status: current Units: bytes

tunnelEcnEctEctByteTotalCount Description: The total number of bytes of incoming packets with the ECT|ECT ECN marking combination (ECT(0) and ECT(1) are treated the same as each other) at the Observation Point since the Metering Process (re-)initialization for this Observation Point. Abstract Data Type: unsigned64 Data Type Semantics: totalCounter ElementId: TBD5 Status: current Units: bytes

tunnelEcnCEMarkedRatio Description: The ratio of packets that are CE-marked to packets that are not CE-marked at the Observation Point. Abstract Data Type: float32 ElementId: TBD6 Status: current

IPFIX over NSH Encapsulating IPFIX messages with an NSH can be an effective method for transporting such messages within an SFC enabled domain. This is particularly the case if different outer transport protocols are used in different parts of such a domain, for example IP in one part and MPLS in another part. This is accomplished by setting the Next Protocol field in the NSH Base Header to the value TBD7 and placing the IPFIX message immediately after the NSH (including after any NSH Metadata). See .

IPFIX over NSH

Example of Use This section provides an example of the solution described in this document. First, IPFIX template records are exchanged between ingress and egress to negotiate the format of the data records to be exchanged. The example here is to measure the congestion level for the overall tunnel caused by all the traffic. After the negotiation is finished, the ingress sends in-band messages to the egress containing the number of each kind of ECN-marked packets (i.e., CE|CE, ECT|N-ECT and ECT|ECT) received before it sent the IPFIX message. After the egress receives the IPFIX message, the egress calculates the CE-marked packet ratio and counts the number of different kinds of ECN-marking packets received before it received that message. Then the egress sends a feedback IPFIX message containing the counts together with the information in the ingress's message back to the ingress. to below illustrate the procedure between ingress and egress.

Template Record Sent from Egress to Ingress

Template Record Sent from Ingress to Egress

Traffic Flow Between Ingress and Egress | | | | | | |ingress| |egress | | | +-+ +-+ | | | | |M| |M| | | | | +-+ +-+ | | | |<---------------------------------------| | | | | | +-------+ +-------+ +-+ |M| : IPFIX Message Packet +-+ +-+ |P| : User Data Packet +-+ ]]>

Traffic Flow Between Ingress and Egress | | | | | | | | | | | | SetID=256, Length=72 | | | | A1 | | | | B1 | | |ingress| C1 |egress | | | A2 | | | | B2 | | | | C2 | | | | D | | | | E | | | | R | | | | <---------------------------- | | | | | | +-------+ +-------+ ]]> The following provides an example of how the tunnel congestion level can be calculated (see ): The congestion Level could be divided into two categories: (1) slight congestion (no packets dropped); (2) serious congestion (packets are being dropped). For slight congestion, the congestion level is indicated by the ratio of CE-marked packets: R = ce_marked_ratio = ce-marked / total_egress ; For serious congestion, the congestion level is indicated as the volume of traffic loss: total_ingress = (A1 + B1 + C1) total_egress = (A2 + B2 + C2 + D + E) volume_loss = (total_ingress - total_egress)

IANA Considerations The following subsections provide IANA assignment considerations.

SFC NSH Header ECN Bits IANA is requested to assign two contiguous bits in the NSH Base Header Bits registry for ECN (bits 16 and 17 suggested) and note this assignment as follows:

Bit	Description	Reference
tbd(16-17)	NSH ECN	[this document]

SFC NSH Next Protocol Value IANA is requested to assign a next protocol value in the NSH Next Protocol Registry, as follows:

Next Protocol	Description	Reference
TBD7	IPFIX	[this document]

IPFIX Information Element IDs IANA is requested to assign seven IPFIX Information Element IDs as follows:

ElementID:: TBD0
Name:: nshServicePathID
Data Type:: unsigned32
Data Type Semantics:: identifier
Status:: current
Description:: The Network Service Header Service Path Identifier.

ElementID:: TBD1
Name:: tunnelEcnCeCeByteTotalCount
Data Type:: unsigned64
Data Type Semantics:: totalCounter
Status:: current
Description:: The total number of bytes of incoming packets with the CE|CE ECN marking combination at the Observation Point since the Metering Process (re-)initialization for this Observation Point.
Units:: octets

ElementID:: TBD2
Name:: tunnelEcnEctNectByteTotalCount
Data Type:: unsigned64
Data Type Semantics:: totalCounter
Status:: current
Description:: The total number of bytes of incoming packets with the ECT|N-ECT ECN marking combination at the Observation Point since the Metering Process (re-)initialization for this Observation Point.
Units:: octets

ElementID:: TBD3
Name:: tunnelEcnCeNectByteTotalCount
Data Type:: unsigned64
Data Type Semantics:: totalCounter
Status:: current
Description:: The total number of bytes of incoming packets with the CE|N-ECT ECN marking combination at the Observation Point since the Metering Process (re-)initialization for this Observation Point.
Units:: octets

ElementID:: TBD4
Name:: tunnelEcnCeEctByteTotalCount
Data Type:: unsigned64
Data Type Semantics:: totalCounter
Status:: current
Description:: The total number of bytes of incoming packets with the CE|ECT ECN marking combination at the Observation Point since the Metering Process (re-)initialization for this Observation Point.
Units:: octets

ElementID:: TBD5
Name:: tunnelEcnEctEctByteTotalCount
Data Type:: unsigned64
Data Type Semantics:: totalCounter
Status:: current
Description:: The total number of bytes of incoming packets with the CE|ECT(0) ECN marking combination at the Observation Point since the Metering Process (re-)initialization for this Observation Point.
Units:: octets

ElementID:: TBD6
Name:: tunnelEcnCEMarkedRatio
Data Type:: float32
Status:: current
Description:: The ratio of CE-marked packets to non-CE-marked packets at the Observation Point.

Security Considerations For general NSH security considerations, see . For security considerations concerning ECN signaling tampering, see . For security considerations concerning ECN and encapsulation, see . For general IPFIX security considerations, see . False congestion feedback could cause throttling or rerouting. If deployed in an untrusted environment, the signaling traffic between ingress and egress can be protected utilizing the security mechanisms provided by IPFIX (see Section 11 in ). The tunnel endpoints (the ingress and egress for an SFC enabled domain) are assumed to be in the same administrative domain, so they will trust each other.

Privacy Considerations It is important to assure unified administrative control and protection against external observation of the SFC domain in which the solution presented in this draft is deployed. The NSH Service Path Identifier (SPI) and associated metadata can be stable identifiers for classes of traffic. When combined with IPFIX-communcated congestion and traffic statistics, these identifiers can enable correlation of traffic characteristics over time. This could allow an observer to infer tenant traffic patterns, service usage, or behavior, especially if metadata includes tenant- or flow-specific identifiers. In aggregate, it could reveal capacity limits, bottleneck locations, peak load times, traffic engineering policies, or the like. Implementations and deployments SHOULD limit the inclusion of identifying metadata to what is operationally necessary.

Normative References Informative References Identifying Modified Explicit Congestion Notification (ECN) Semantics for Ultra-Low Queuing Delay (L4S)

Acknowledgements Most of the material on Tunnel Congestion Feedback was originally in draft-ietf-tsvwg-tunnel-congestion-feedback. After discussion with the authors of that draft, the authors of this draft, and the Chairs of the TSVWG and SFC Working Groups, the Tunnel Congestion Feedback draft was merged into this draft. The authors wish to thank the following for their comments, suggestions, and reviews: David Black, Mohamed Boucadair, Sami Boutros, Anthony Chan, Lingli Deng, Liang Geng, Joel Halpern, Jake Holland, John Kaippallimalil, Tal Mizrahi, Vincent Roca, Lei Zhu.