<?xml version='1.0' encoding='utf-8'?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<?xml-model href="rfc7991bis.rnc"?>  <!-- Required for schema validation and schema-aware editing -->
<!-- <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?> -->
<!-- This third-party XSLT can be enabled for direct transformations in XML processors, including most browsers -->

<rfc
      xmlns:xi="http://www.w3.org/2001/XInclude"
      category="std"
      docName="draft-zhang-idr-portid-ec-02"
      ipr="trust200902"
      obsoletes=""
      updates=""
      submissionType="IETF"
      xml:lang="en"
      tocInclude="true"
      tocDepth="4"
      symRefs="true"
      sortRefs="true"
      version="3">
  <!-- xml2rfc v2v3 conversion 2.38.1 -->
  <!-- category values: std, bcp, info, exp, and historic
    ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
       or pre5378Trust200902
    you can add the attributes updates="NNNN" and obsoletes="NNNN" 
    they will automatically be output with "(if approved)" -->

 <!-- ***** FRONT MATTER ***** -->

 <front>
    <!-- The abbreviated title is used in the page header - it is only necessary if the 
        full title is longer than 39 characters -->

   <title abbrev="Abbreviated Title">BGP PORT EC for AIDC</title>
    <seriesInfo name="Internet-Draft" value="draft-zhang-idr-portid-ec-02"/>
    <!-- add 'role="editor"' below for the editors if appropriate -->

   <!-- Another author who claims to be an editor -->

   <author fullname="Junye Zhang" initials="J" surname="Zhang">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street/>
          <!-- Reorder these if your country does things differently -->

         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>zhangjunye@chinamobile.com</email>
        <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
	
	<author fullname="Rui Zhuang" initials="R" surname="Zhuang">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street/>
          <!-- Reorder these if your country does things differently -->

         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>zhuangruiyjy@chinamobile.com</email>
        <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
   
   <author fullname="Zheng Zhang" initials="Z" role="editor" surname="Zhang">
      <organization>ZTE Corporation</organization>
      <address>
        <postal>
          <street/>
          <!-- Reorder these if your country does things differently -->

         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>zhang.zheng@zte.com.cn</email>
        <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
	
	<author fullname="Dongyu Yuan" initials="D" surname="Yuan">
      <organization>ZTE Corporation</organization>
      <address>
        <postal>
          <street/>
          <!-- Reorder these if your country does things differently -->

         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>yuan.dongyu@zte.com.cn</email>
        <!-- uri and facsimile elements may also be added -->
     </address>
    </author>

    <date year="2026"/>
    <!-- If the month and year are both specified and are the current ones, xml2rfc will fill 
        in the current day for you. If only the current year is specified, xml2rfc will fill 
     in the current day and month for you. If the year is not the current one, it is 
     necessary to specify at least a month (xml2rfc assumes day="1" if not specified for the 
     purpose of calculating the expiry date).  With drafts it is normally sufficient to 
     specify just the year. -->

   <!-- Meta-data Declarations -->

   <area>Routing</area>
    <workgroup>IDR</workgroup>
    <!-- WG name at the upperleft corner of the doc,
        IETF is fine for individual submissions.  
     If this element is not present, the default is "Network Working Group",
        which is used by the RFC Editor as a nod to the history of the IETF. -->

   <keyword>BGP PORT AIDC</keyword>
    <!-- Keywords will be incorporated into HTML output
        files in a meta tag but they have no effect on text or nroff
        output. If you submit your draft to the RFC Editor, the
        keywords will be used for the search engine. -->

   <abstract>
      <t>This document introduces a new BGP extended community attribute for AI computing scenarios.
         This attribute is used to carry the port ID when advertising routes on the switch before launching AI tasks, 
		 preparing for negotiation before sending large-scale traffic.</t>
    </abstract>
  </front>
  <middle>
    <section numbered="true" toc="default">
      <name>Introduction</name>
      <t>With the rapid development of Artificial Intelligence (AI) and Machine Learning (ML), 
	  AI tasks often generate large traffic due to the characteristics of large language model computation (LLM). 
	  If the link bandwidth is insufficient, packet loss may occur. 
	  AI computation has very high reliability requirements and extremely low tolerance for packet loss and latency. 
	  When there is link congestion in the network that leads to packet loss or excessive latency, 
	  it will have a significant impact on the computational efficiency of AI tasks.</t>
	  
	  <t>In data centers used for AI and machine learning, BGP is often used as the routing protocol <xref target="RFC7938" format="default"/>. 
	  In some implementations, sufficient bandwidth between the destination server and its connected leaf switches 
	  must be ensured before sending traffic for AI tasks.
      On the network side, specifically the area comprised of the Leaf and Spine switches, 
	  there are numerous ECMP links. 
	  Techniques such as packet spraying can be used to minimize congestion and packet loss. 
	  However, on the computing side, specifically the last hop between the Leaf switches and the server, 
	  congestion can easily lead to packet loss, significantly reducing the efficiency of AI tasks. 
	  To minimize or eliminate packet loss on the last hop, 
	  BGP needs to be extended to include port information on the destination leaf switch. 
	  This allows the sender to negotiate based on this information before sending traffic, 
	  ensuring sufficient bandwidth is available in the last hop and preventing congestion and packet loss due to insufficient bandwidth.
	  To reduce the stress caused by full-mesh connections, Leaf switches do not establish neighbors with each other.</t>
	  
      <t><xref target="I-D.zhuang-rtgwg-aidc-gse-architecture"/> demonstrates two common deployment scenarios in AIDC. 
	  In both scenarios, it is necessary to advertise the corresponding ports along with the routes.</t>

      <section numbered="true" toc="default">
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
       document are to be interpreted as described in <xref target="RFC2119" format="default"/>.</t>
      </section>
    </section>

    <section numbered="true" toc="default">
      <name>Format</name>
      <t>When announcing the route to the connected server or to the other PoDs, the BGP protocol on the Leaf switch 
	or on the SSpine switch carries the switch's address and the port ID information.</t>
	
	<section numbered="true" toc="default">
      <name>PORT EC format</name>
	  
	<t>Transitive IPv4-Address-Specific Extended Community defined in <xref target= "RFC7153"/> 
	and <xref target="I-D.ietf-idr-rfc4360-bis"/> 
	with new sub-type "Route Port ID" is used for carry the IPv4 address of switch 
	and the related port ID to the destination. </t>
	
	<t>Transitive IPv6-Address-Specific Extended Community defined in <xref target= "RFC5701"/> 
	with new sub-type "Route Port ID" is used for carry the IPv6 address of the switch 
	and the related port ID to the destination.</t>
	
	  <figure anchor="Fig1">
        <artwork align="left" name="Figure 1" type="" alt=""><![CDATA[
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | 0x01 or 0x41  |   Sub-Type    |    Global Administrator       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Global Administrator (cont.)  |    Local Administrator        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           ]]></artwork>
      </figure>
	  
	  <t>Figure 1 shows the format of IPv4-Address-Specific Extended Community, where:</t>
	  <ul spacing="normal">
        <li>Sub-Type: TBD. This indicates that this is the Route Port ID extended community;</li>
        <li>Global Administrator: 4 octets, set to the IPv4 address of the switch that advertises the route. 
		This address can be the loopback address for establishing the BGP connection;</li>
		<li>Local Administrator: 2 octets, set as the port ID of the switch leading to this route, 
		with a value range of 1 to 65535.</li>
      </ul>
	  
		  <figure anchor="Fig2">
        <artwork align="left" name="Figure 2" type="" alt=""><![CDATA[
       0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | 0x00 or 0x40  |    Sub-Type   |    Global Administrator       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Global Administrator (cont.)                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Global Administrator (cont.)                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Global Administrator (cont.)                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Global Administrator (cont.)  |    Local Administrator        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           ]]></artwork>
      </figure>
	  
	  <t>Figure 2 shows the format of IPv6-Address-Specific Extended Community, where:</t>
	  <ul spacing="normal">
        <li>Sub-Type: TBD. This indicates that this is the Route Port ID extended community;</li>
        <li>Global Administrator: 16 octets, set to the IPv6 address of the switch that advertises the route. 
		This address can be the loopback address for establishing the BGP connection;</li>
		<li>Local Administrator: 2 octets, set as the port ID of the switch leading to this route, 
		with a value range of 1 to 65535.</li>
      </ul>
	  
	  </section>
	  
	  <section numbered="true" toc="default">
      <name>Aggregated PORT EC format</name>
	  <t>TBD.</t>
	  </section>
	  
    </section>

    <section numbered="true" toc="default">
      <name>Specification</name>
      <t>When an advertisement is sent to a server or a route learned from the core, 
	  the Leaf switch or PoD edge switch will carry the extended community attribute newly defined in this document.
	  This newly added extended community attribute "Route Port ID extended community" 
	  and "Aggregate PORT Route Port ID extended community" will be abbreviated as PORT EC and AGG PORT EC
	  in the following text.</t>
      
      <section numbered="true" toc="default">
      <name>PORT EC advertisement within PoD</name>
	  
	  <t>Scenario 1 in <xref target="I-D.zhuang-rtgwg-aidc-gse-architecture"/> describes a scenario 
	  within a single Point of Destiny (PoD). 
	  Before sending traffic, each Leaf, in addition to announcing its own loopback route, 
	  also announces routes to connected GPUs/NICs, carrying the extended community attributes defined in this document. 
	  The Global Administrator in the PORT EC is set to the Leaf's own loopback address, 
	  and the Local Administrator in the PORT EC is set to the port number connected to the GPU/NIC. 
	  Routes are forwarded through the Spine device, allowing each Leaf to learn routes to other GPUs within its PoD, 
	  including connected Leaf devices and their ports.</t>
	 
 	  <t>As illustrated in the example, when GPU1 wants to send traffic to GPU20, 
	  it needs to ensure that the interface between Leaf3 and NIC20 does not experience incasting, 
	  meaning the link between Leaf3 and NIC20 has sufficient bandwidth to receive the traffic. 
	  According to the authorization mechanism mentioned in <xref target="I-D.zhuang-rtgwg-aidc-gse-architecture"/>, 
	  before GPU1 sends traffic, it needs to request authorization from GPU20. 
	  This authorization includes a bandwidth request. 
	  If GPU20 determines that the link bandwidth with Leaf3 is sufficient, 
	  it will send a successful authorization message back to GPU1. 
	  GPU1 then begins sending traffic.</t>
      
      <t>Upon receiving the route carrying the PORT EC, 
	  the leaf switch checks if the address carried in the PORT EC is reachable. 
	  If unreachable, the extended community is ignored. 
	  If reachable, the address and port information are stored locally or sent to the server. 
	  This storing or sending process is outside the scope of this draft.</t>
	  
	  <t>When the GPU or NIC does not support the authorization mechanism, this task can also be performed by the Leaf device. 
	  Leaf1 initiates a authorization request to Leaf3, specifying the required bandwidth and the link between Leaf3 and GPU20/NIC20. 
	  If Leaf3 determines that the link bandwidth with GPU20/NIC20 is sufficient, 
	  it will send a successful authorization message back to Leaf1, and only then will Leaf1 begin forwarding traffic from GPU1. 
	  This scenario also requires negotiation of traffic transmission request information between GPU1/NIC1 and Leaf1, 
	  which is out the scope of this document.</t>
	  
	  <t>After traffic enters Leaf1, it undergoes GSE header encapsulation which described in <xref target="I-D.zhuang-rtgwg-aidc-gse-architecture"/>
	  as it travels through the Spine device to Leaf3. 
	  During this process, the header is used for packet spraying on ECMP routing. 
	  Upon reaching Leaf3, Leaf3 then forwards the packet to NIC20. 
	  When the NIC supports both authorization mechanisms and GSE header encapsulation, 
	  the Leaf device only needs to forward packets based on the GSE header. 
	  When the NIC does not support authorization mechanisms or GSE encapsulation, 
	  the Leaf device performs the GSE header encapsulation, and upon reaching Leaf3, 
	  Leaf3 decapsulates the GSE header and rearranges any out-of-order packets before sending them to NIC20.</t>	  
	  </section>
	  
	  
	  <section numbered="true" toc="default">
      <name>AGG PORT EC advertisement</name>

      <t>Scenario 2 in <xref target="I-D.zhuang-rtgwg-aidc-gse-architecture"/> describes a traffic interconnection scenario 
	  between multiple PoDs, where traffic needs to be sent to the GPUs of other PoDs. 
	  The Core device in the diagram is used to interconnect the SSpine devices of different PoDs; 
	  the SSpine device is a device within a PoD used to connect to the Core device for cross-PoD interconnection. 
	  In some deployment scenarios, the SSpine layer device may be omitted, and the Spine device directly interconnects 
	  with the Core device.</t>
	  
	  <t>As seen in Scenario 2 in <xref target="I-D.zhuang-rtgwg-aidc-gse-architecture"/>, 
	  there may be multiple ECMP paths from Spine to SSpine. 
	  When the same advertised route may have multiple different PORT EC attributes,
      only one route will be selected as the optimal route according to the EBGP forwarding rules within the data center,
      and then forwarded to other nodes.
      This does not make good use of resources. 
	  For example, in Scenario 2, Spine1 selects the optimal route received from SSpine switches to other PoDs. 
	  Though not only SSpine1 can be used as the next hop, but other SSpine switch can also be used as ECMP next hops. 
	  However, after Spine switch select the best route with the same prefix, 
	  they may only advertise one corresponding PORT EC attribute to Leaf swtiches. 
	  In this case, Leaf swithes can only negotiate with the link between Spine1 and SSpine1, 
	  and cannot use the ECMP link between Spine1 and other SSpine switches, which will result in a waste of resources.</t>
	  
	  <t>To avoid attribute loss after BGP route selection, 
	  one approach is to enable the ADD-PATH function defined in <xref target= "RFC7911"/> in the PoD. 
	  However, this function may cause a route advertisement storm, 
	  severely impacting the efficiency of route advertisement and affecting normal forwarding traffic. 
	  Another approach is to use the AGG PORT EC attributes defined in this document. 
	  This helps establish more comprehensive ECMP entries.</t>
	  
	  <t>Suppose GPU1 needs to send traffic to GPU5000 in another PoD, such as PoD2. 
	  Within PoD1, where GPU1 resides, a similar approach to Scenario 1 can be used to ensure that traffic is not out of order 
	  or lost before being sent to the Core device. 
	  According to BGP route advertisements, routes from other PoDs, such as the route for GPU5000, will be advertised to 
	  the SSpine device through multiple Core devices, such as Core1 through Core8. 
	  The SSpine device will then advertise these routes to the Spine device, which in turn advertises them to the Leaf device. 
	  During this advertisement process, the SSpine device, acting as the interface between its PoD and the Core layer devices, 
	  will include its own address and ports when advertising routes obtained from other PoDs. 
	  In this example, SSpine1 receives routes from GPU5000 from Core1 through Core8. 
	  When advertising these routes to the Spine device, it will include the AGG PORT EC attributes defined in this document. 
	  Because there are multiple ports connected to the Core, they need to be carried in an aggregated list format.</t>
	  
	  <t>When Spine advertises routes received from SSpine to Leaf devices, it needs to further aggregate different AGG PORT ECs 
	  for the same route. 
	  For example, Spine1 might receive routes from SSpine1 to SSpine8 advertising GPU5000. 
	  Spine1 will then use AGG PORT ECs to further aggregate these advertising devices and ports before advertising them to Leaf devices. 
	  In this way, the GPU5000 routes learned by Leaf devices will include multiple devices from SSpine to SSpine8, 
	  and each SSpine will also have a set of ports.</t>

      <t>To avoid incasting between the SSpine device and the Core device, a similar approach to Scenario 1 is adopted: 
	  before Leaf1 forwards traffic, it negotiates authorization with the SSpine device. 
	  Leaf1 has multiple selectable SSpine devices, each with multiple interfaces connected to the Core. 
	  Leaf1 can send authorization requests based on the received SSpine device and port information. 
	  For example, in this case, Leaf1 can send an authorization request to SSpine1, specifying the interfaces between SSpine1 and Core1. 
	  If the authorization request fails, it can then send an authorization request to the interfaces between SSpine1 and Core2. 
	  When all interfaces on SSpine1 connected to the Core cannot meet the requirements, it sends an authorization request to SSpine2, 
	  specifying the interfaces between SSpine2 and Core1, and so on, until an SSpine device and port capable of handling 
	  the bandwidth are found. 
	  Of course, in the implementation, the method of finding the target SSpine device and its port can be optimized, 
	  not necessarily starting from the first port of the first device each time, to improve efficiency.</t>
	  
	 <t>During the entire forwarding process from Leaf to SSpine, the traffic also needs to be encapsulated with 
	 the GSE header described in <xref target="I-D.zhuang-rtgwg-aidc-gse-architecture"/>. 
	 The fields in the header will guide the traffic packets to be sprayed onto the ECMP path. 
	 When the packet reaches the selected SSpine device, the SSpine device will preserve the order of the packet 
	 and then send it to the Core device to ensure reliable forwarding of the traffic within this PoD.</t>
     
	 </section>
	  
    </section>
    
    <section anchor="IANA" numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>IANA is requested to allocate two new code points from 
	  the "Transitive IPv4-Address-Specific Extended Community Sub-Types" 
	  and the "Transitive IPv6-Address-Specific Extended Community Sub-Types" registry.</t>
	    <table anchor="table_1" align="center">
          <name>TABLE_1	</name>
          <thead>
            <tr>
              <th align="center">Type</th>
              <th align="center">Description</th>
			  <th align="center">Reference</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td align="center">TBD</td>
              <td align="center">Route Port ID</td>
			  <td align="center">This Document</td>
            </tr>
          </tbody>
        </table>
		
      <t>Aggregated PORT EC registry: TBD.</t>
    </section>
	
    <section anchor="Security" numbered="true" toc="default">
      <name>Security Considerations</name>
      <t>This extension to BGP has similar security implications as BGP Extended Communities <xref target= "RFC7153"/>,
	  <xref target= "RFC5701"/> and <xref target="I-D.ietf-idr-rfc4360-bis"/>.</t>
    </section>
  </middle>
  <!--  *****BACK MATTER ***** -->

 <back>

   <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
		<xi:include href="http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
		<xi:include href="http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5701.xml"/>
		<xi:include href="http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7153.xml"/>
		<xi:include href="https://datatracker.ietf.org/doc/bibxml3/draft-ietf-idr-rfc4360-bis.xml"/>
      </references>
      <references title="Informative References">
		<xi:include href="http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7911.xml"/>
		<xi:include href="http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7938.xml"/>
		<xi:include href="https://datatracker.ietf.org/doc/bibxml3/draft-zhuang-rtgwg-aidc-gse-architecture.xml"/>
    </references>
    </references>
 </back>
</rfc>
