WO2006004780A1 - Advanced switching peer-to-peer protocol - Google Patents
Advanced switching peer-to-peer protocol Download PDFInfo
- Publication number
- WO2006004780A1 WO2006004780A1 PCT/US2005/022975 US2005022975W WO2006004780A1 WO 2006004780 A1 WO2006004780 A1 WO 2006004780A1 US 2005022975 W US2005022975 W US 2005022975W WO 2006004780 A1 WO2006004780 A1 WO 2006004780A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- endpoint
- peer
- connection
- fabric
- requesting
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/28—Routing or path finding of packets in data switching networks using route fault recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0811—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/34—Source routing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/25—Routing or path finding in a switch fabric
- H04L49/253—Routing or path finding in a switch fabric using establishment or release of connections between ports
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/55—Prevention, detection or correction of errors
- H04L49/552—Prevention, detection or correction of errors by ensuring the integrity of packets received through redundant connections
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/60—Software-defined switches
- H04L49/602—Multilayer or multiprotocol switching, e.g. IP switching
Definitions
- the field of invention relates generally to computer and processor-based systems, and, more specifically but not exclusively relates to techniques for managing peer-to-peer communication links facilitated by serial-based interconnect fabrics.
- PCI Peripheral Component Interconnect
- communications equipment has historically incorporated many board-level and system-level interconnects, some proprietary, while others being based on standards such as PCI.
- PCI Peripheral Component Interconnect
- an abundance of interconnect technologies creates complexity in interoperability, coding, and physical design, all of which drive up cost.
- the use of fewer, common interconnects will simplify the convergence process and benefit infrastructure equipment developers.
- PCI 1.0 e.g., PCI 1.0
- PCI 1.0 PCI 1.0
- the original scheme employed a hierarchy of busses, with "bridges" used to perform interface operations between bus hierarchies.
- the original PCI standard was augmented by the PCI-X standard, which was targeted towards PCI implementations using higher bus speeds.
- serial interconnects reduce pin count, simplify board layout, and offer speed, scalability, reliability and flexibility not possible with parallel busses, such as employed by PCI and PCI-X.
- Current versions of these interconnect technologies rely on high-speed serial (HSS) technologies that have advanced as silicon speeds have increased.
- HSS high-speed serial
- PCI Express architecture One such standardized serial technology is the PCI Express architecture.
- the PCI Express architecture is targeted as the next-generation chip-to-chip interconnect for computing.
- the PCI Express architecture was developed by a consortium of companies, and is managed by the PCI SIG (special interest group).
- the PCI Express architecture supports functionalities defined in the earlier PCI and PCI-X bus-based architectures.
- PCI and PCI-X compatible drivers and software are likewise compatible with PCI Express devices.
- the enormous investment in PCI software over the last decade will not be lost when transitioning to the new PCI Express architecture.
- PCI inheritance aspect of PCI Express is a significant benefit, it also results in some limitations due to the continued support of "legacy” devices employing personal computer (PC) architectural concepts developed in the early 1980's.
- PC personal computer
- AS Advanced Switching
- AS enhances the capabilities of PCI Express by defining compatible extensions, including extensions that address the deficiencies in legacy monolithic processing architectures.
- AS further includes inherent features targeted toward the communications markets, including data-plane functions, flexible protocol encapsulation, and more.
- Figure 1 is block diagram of a layered architecture illustrated primary layers in the PCI Express and the Advanced Switching (AS) standards;
- Figure 2 is a schematic block diagram showing an exemplary AS use model corresponding to a communications implementation;
- Figure 3 is a block schematic diagram illustrating software components in a distributed software architecture, according to one embodiment of the invention.
- Figure 4 is a block schematic diagram illustrating details of software sub ⁇ components and interfaces corresponding to the Primary Fabric Management (PFM) component of Figure 3;
- PFM Primary Fabric Management
- Figure 5 is a block schematic diagram illustrating details of software sub ⁇ components and interfaces corresponding to the Endpoint (EP) component of Figure 3;
- Figure 6 is a block schematic diagram illustrating details of software sub ⁇ components and interfaces corresponding to the Secondary Fabric Management (SFM) component of Figure 3;
- SFM Secondary Fabric Management
- Figure 7 is a block schematic diagram illustrating details of software sub ⁇ components and interfaces corresponding to the AS driver component of Figure 3;
- Figure 8 is a schematic diagram illustrating an exemplary implementation architecture that includes a PFM system, SFM system and two EP systems, each running on a respective platform comprising an AS device coupled to an AS Fabric;
- Figure 9 is a flowchart illustrating operations performed to establish a peer- to-peer connection between a requesting endpoint and a target endpoint, according to one embodiment of the invention;
- Figure 10 is a message flow diagram illustrating details of messages passed between a requesting endpoint, fabric manager, and target endpoint in accordance with the peer-to-peer connection establishment process of Figure 9;
- Figure 11 is a schematic diagram illustrating a scheme for storing attribute information from which characteristics and capabilities of a PCI Express device can be specified;
- Figure 12 is a flowchart illustrating operations performed to close a peer-to- peer connection; according to one embodiment of the invention.
- Figure 13a is a frontal isometric view of an exemplary blade server chassis in which a plurality of server blades are installed;
- Figure 13b is a rear isometric view of the blade server chassis of Figure 13 a;
- Figure 14a is a frontal isometric view of an exemplary ATCA chassis;
- Figure 14b is an isometric view of an exemplary ATCA board;
- Figure 15 is a schematic diagram of an exemplary AS communications ecosystem.
- a fundamental goal of PCI Express is to provide an easy migration strategy to expand from the legacy PCI technology into the new serial-based link technology.
- PCI Express accomplishes this by being fully compatible to the existing PCI hardware and software architectures. As a result, PCI Express also inherits the limitations of a global memory address-based and tree topology architecture. This limits the ability of PCI Express to be effectively utilized in peer-to-peer communications between multiple hosts in various topologies, such as star, dual-star, and meshes. These topologies are typically used in blade servers, clusters, storage arrays, and telecom routers and switches.
- the PCI Express architecture is based upon a single host processor or root complex that controls the global memory address space of the entire system. Upon power-up and enumeration process, the root complex interrogates the entire system by traversing through the hierarchical tree-topology and locates all endpoint devices that are connection in the system. A space is allocated for each endpoint device in the global memory in order for the host processor to communicate with it.
- PCI Express extends the inherent transparent bridging concept of PCI to non-transparent bridges. This technique is typically used in applications where there are one or more sub-processing system or intelligent endpoints that require their own isolated memory space.
- both sides of the bridge are logically treated as endpoints from each local processor's perspective.
- a mirror memory space of equal size is independently allocated on each side of the bridge during each processor's enumeration process.
- the non-transparent bridge is programmed to provide the address translation function in each direction between the two processor' memory maps.
- Non-transparent bridges provide adequate congestion management required for highly utilized peer-to-peer communications. In peer-to-peer environments where many highly utilized host processors are pushing and pulling data independently and simultaneously, there needs to be a more sophisticated level of congestion management to control the behavior and communications between the interconnected processors. Non-transparent bridges also require an extensive amount of software provisions and reconfiguration to implement fail-over mechanisms in high availability systems. This results in additional design complexity, resource utilization and response times that may not be tolerable for certain applications.
- Advanced Switching architecture was designed to provide a native interconnect solution for multi-host, peer-to-peer communications without additional bridges or media access control.
- AS employs a packet-based transaction layer protocol that operates over the PCI Express physical and data link layers ⁇ e.g., physical layer 100 and data link layer 102 in Figure 1).
- Advanced Switching provides enhanced features such as sophisticated packet routing, congestion management, multicast traffic support, as well as fabric redundancy and fail-over mechanism to support high performance, highly utilized, and high availability system environments.
- FIG. 2 An exemplary Advanced Switching use model is shown in Figure 2. This particular use model is targeted toward telecommunications usage; however, AS may be applicable to many types of communications and computer environments.
- the use model includes media access elements 200 and 202, each cross-connected to AS fabric elements 204 and 206. Each of the AS fabric elements 204 and 206 are, in turn, cross- connected to CPU (central processing unit) sub-systems 208 and 210, as well as network processor units (NPUs) 212 and 214.
- CPU central processing unit
- NPUs network processor units
- AS is media and switching fabric agnostic, meaning the AS protocol functions the same regardless of the underlying media and switching fabric implementation.
- AS can support underlying communication protocols, via protocol encapsulation.
- AS includes internal protocol interfaces that can be used to tunnel various protocols such as Ethernet, Fibre Channel, and Inf ⁇ niband.
- AS includes internal protocol interfaces that can be used to tunnel various protocols such as Ethernet, Fibre Channel, and Inf ⁇ niband.
- the architecture includes four major components. This include a Primary Fabric Manager (PFM) component 300, an endpoint (EP) component 302, a Secondary Fabric Manager (PFM) component 304, and an AS driver component 306.
- PFM Primary Fabric Manager
- EP endpoint
- SP Secondary Fabric Manager
- FIG. 3 includes three types of such devices: a PFM device 308, an SFM device 310, and an EP device 312. As described below in further details, under some configurations a single device may serve as both a PFM or SFM device and an EP device.
- the sub ⁇ components include a fabric discovery/configuration sub-component 400, a unicast sub- component 402, a multicast sub-component 404, a High Availability (HA) sub ⁇ component 406, a event management sub-component 408, a third-party vendor (TPV) secure interface 410, a local resource management sub-component 412, a hardware (HW) interface 414, a mass storage interface 416, and user interface 418.
- the fabric discovery/configuration sub-component 400 is responsible for discovery and configuration of the fabric by the initial PFM or by a new PFM when the existing PFM fails. Additionally, as devices are hot added/removed from a system, this sub-component performs re-discovery of the fabric, if needed, and configures the new devices.
- the unicast sub-component 402 implements the unicast protocol defined by the software design. It is responsible for the tasks that are related to and management of point-to-point (PtP) communications between EPs in the fabric.
- the multicast sub-component 404 implements the multicast protocol defined by the software design. It is responsible for the tasks that are related to and management of multicast communications between EPs in the fabric.
- the High Availability sub-component 406 implements the HA protocol defined by the software design. It is responsible for establishing a secondary fabric manager in the fabric and synchronizing the fabric data and tasks related to devices/links failure and/or hot added/removed devices.
- the event management sub-component 408 manages events that are received from the fabric. Generally, the events may be informational or they may indicate error conditions.
- the TPV secure interface 410 provides an interface between third party vendor software through an AS driver component 306. This interface provides access to the vendor's specific devices and their proprietary registers in the fabric. However, to provide security and to allow only authorized software to access devices, the TPV sub-component in the AS driver component interfaces to the TPV interface in the PFM and to the TPV software in order to route packets between the two. Only valid requests are granted access to the fabric by the PFM.
- the local resource management sub-component 412 provides an interface to the local resources (such as memory) that exist on the PFM host device.
- the HW interface 414 provides an interface to the AS driver component 306. It is through this interface that packets are sent/received to/from the fabric.
- the mass storage interface 416 sub-component provides an interface to a mass storage device (such as disk drive) that may exist on the device.
- the user interface 418 sub-component provides a user interface to display fabric-related information such as fabric topology and current PtP connections. Additionally, connections can be initiated between EPs in the fabric through this interface.
- the EP component 302 is made up of tasks that are performed by an EP device.
- Figure 5 shows one embodiment of the sub-components making up the EP component and the interfaces between them.
- the sub-components include a unicast sub-component 500, a multicast sub-component 502, a simple load/store (SLS) sub ⁇ component 504, a local resource management sub-component 506, a hardware interface sub-component 508, and a mass-storage interface sub-component 510.
- SLS simple load/store
- the unicast sub-component 500 implements the unicast protocol defined by the software design. It is responsible for the tasks that are related to establishing and management of PtP communications between this device and other EPs in the fabric.
- the multicast sub-component 502 implements the multicast protocol defined by the software design. It is responsible for the tasks that are related to establishing and management of multicast communications between the host device for the multicast sub-component and other EPs in the fabric. This is the EP matching component of the PFM's multicast sub-component 404.
- the simple load/store (SLS) sub-component 504 is responsible for the management of all the SLS connections between its host device and other EPs in the fabric. It creates SLS connections and instructs its SLS counterpart in the AS driver to configure and store the connection for SLS applications.
- the local resource management sub-component 506 provides an interface to the local resources (such as memory) that exist on the device hosting an EP component 302.
- the HW interface 508 provides an interface to an instance of AS driver component 306. It is through this interface that packets are sent/received to/from the fabric.
- the mass storage interface 510 sub-component provides an interface to a mass storage device (such as a disk drive) that exists on the EP component host device.
- the SFM component 304 is made up of tasks performed by the secondary fabric manager.
- Figure 6 shows the sub-components making up the SFM component and the interfaces between them. These include a High Availability (HA) sub ⁇ component 600, a hardware interface 602, and a mass storage interface 604.
- the High Availability sub-component 600 implements the HA protocol defined by the software design. It is responsible for establishing a connection with the PFM in the fabric, synchronizing the fabric data with it, and monitoring the PFM. Additionally, it is responsible for failing-over to the PFM component if it determines that it has failed. This is the matching component of the PFM's HA sub ⁇ component 300.
- the HW interface 602 provides an interface to an instance of AS driver component 306. It is through this interface that packets are sent/received to/from the fabric.
- the mass storage interface 604 sub-component provides an interface to the mass storage (such as hard disk) that exists on the EP component host device.
- the AS Driver component 306 is made up of the tasks to initialize the hardware to send/receive packets to/from the fabric and it provides interfaces to the other components.
- Figure 7 shows the sub-components making up this component and the interface between them.
- the sub-components include a hardware interface register 700, an AS hardware driver 702, and an SLS sub-component 704.
- the hardware interface register includes a PFM component interface 706, and EP component interface 708, an SFM component interface 710, a TPV interface 712, and an SLS application interface 714.
- the AS hardware driver 702 includes a configuration sub-component 716 and interrupt service routines 718.
- the hardware interface register 700 provides an interface to user-level application programs. Through these interfaces the applications discussed above are enabled to send/receive packets to/from the fabric. Each application registers with this sub-component for the packet types that it sends/receives.
- the TPV interface 712 sub-component provides interfaces to the third party vendor software and to its TPV counterpart in PFM component 300. Requests coming to the driver from third party software to access certain devices in the fabric will be verified with the PFM to determine if the request is to be granted or not.
- This sub ⁇ component provides interfaces to route packets between the TPV software and the PFM. The PFM then provides the security to whether allow a packet to the fabric or not by TPV software and which TPV software, if any, is the recipient of a packet from the fabric.
- the AS hardware driver 702 sub-component is responsible for the initial configuration of the hardware devices. Additionally, it provides the interrupt service routines 718 for the devices.
- the SLS sub-component 704 is a counterpart of SLS sub-component 504 in EP component 302. It is instructed from the EP component to configure SLS connections while SLS sub-component 504 in the EP creates the connections. Additionally, it saves connection information so that the applications requesting SLS connection can directly interface with it in order to send/receive SLS packets.
- the various software components discussed herein may be implemented using one or more conventional architecture structures.
- a component or sub-component may comprise an application running on an operating system (OS), an embedded application running with or without an operation system, a component in an operating system kernel, an operating system driver, a firmware-based component, etc.
- FIG 8 shows an exemplary software architecture in which some of the various software components are embodied as applications running in the user space of an operating system, while other components are embodied as OS kernel space components.
- the software components are used to host a PFM system' 800, an SFM system 802, and EP systems 804A and 804B.
- each software system is run by one or more processors provided by a respective platform 806A, 806B, 806C, and 806D.
- platform is used herein to refer to any type of computing device suitable for running a PFM, SFM or EP system.
- platforms include, but are not limited to, server blades, telecom line cards, and ATCA boards.
- each of platforms 806A-D is linked in communication with the other platforms via an AS fabric 808.
- the AS fabric facilitates serial interconnects between devices coupled to the physical AS fabric components.
- the AS fabric components may include dedicated AS switching devices, an active backplane with build-in AS switching functionality, or the combination of the two.
- the PFM system 800 comprises a set of software components used to facilitate primary fabric management operations. These components include one or more SLS applications 810, an EP component 302, a PFM component 300, and an AS driver component 306.
- SLS application, EP component, and PFM component comprise applications running in the user space of an operating system hosted by platform 806A.
- the AS driver component comprises an OS driver located in the kernel space of the OS.
- the software components of SFM system 802 are configured in a similar manner to those in PFM system 800.
- the user space components include one or more SLS applications 810, an EP component 302, and an SFM component 304.
- An AS driver component is located in the kernel space of the operating system hosted by platform 806B.
- each of EP systems 804A and 804B are depicted with similar configurations.
- the user space components include one or more SLS applications 810 and an EP component 302.
- an AS driver component 306 is located in the kernel space of the operating system running on the platform hosting an EP system (e.g., platforms 806C and 806D.
- AS fabric management can be performed using one of three models, each with their own advantages and disadvantages. Under a centralized fabric management model, there is a central FM authority in the fabric that runs the AS fabric. The FM has full view of the fabric, is aware of all the activities in the fabric, and is responsible for all the fabric-related tasks.
- the fabric-related information is not maintained in a central location.
- EPs perform their own discovery, establish their own connections and perform other tasks without intervention by the FM.
- This model supports multiple FMs.
- the FM performs tasks such as device discovery, while the EPs do other tasks on their own, such as establishing their own connections.
- the hybrid fabric management model is used to manage unicast peer-to-peer connections. Under this approach, the fabric topology and the information about devices are collected and maintained by the FM.
- FD Fabric Discovery
- the FM records which devices are connected, collects information about each device in the fabric, constructs a map of the fabric, and configures appropriate capabilities and/or tables in the devices' configuration space.
- a fully distributed mechanism is employed, wherein the FM may concurrently collect information from more than one device.
- discovery happens in three stages - enumeration, reading devices' configuration space (capabilities and tables), and configuring devices (writing into capabilities and tables).
- the FM performs three tasks, including visiting each device through all paths leading to that device, collecting certain capabilities' offsets for each device discovered, and initializing each device's serial number if a serial number is not already initialized (by the manufacturer, firmware, etc.).
- a full discovery and configuration algorithm is run by the Primary Fabric Manager. Additionally, the Primary and Secondary Fabric Managers may perform discovery and configuration operations during fabric run-time, such as in response to the detection of a hot install/remove event. In the event of a failure, FM operations that were previously performed by a PFM are performed by an SFM, which reconfigures itself as the new PFM for the system.
- One of the most valuable functions facilitated by AS is peer-to-peer communication, also known as unicast communication or a unicast link, hi one embodiment, a unicast protocol facilitated by an FM component and an EP component are employed to manage unicast operations.
- the FM component e.g., PFM unicast sub-component 402
- the EP component e.g., EP unicast sub-component 500
- a unicast link To perform a peer-to-peer communication between EP devices, a unicast link must first be established. Operations for setting up an unicast link, according to one embodiment, are shown in Figure 9, while Figure 10 illustrates a set of messages corresponding to the flowchart of Figure 9 that are passed between a requester EP 100, a fabric manager 1002, and a target EP 1004 to perform the unicast link setup task. [0083]
- the setup process begins in a block 900, wherein a requesting endpoint sends a query to the fabric manager requesting connection information about target endpoints matching specific attributes identified in the request. This message is depicted as a Query Request 1006 in Figure 10, which is sent from request EP 1000 to fabric manager 1002.
- Query Request 1006 includes a NumDevs parameter, an attributes parameter set, and a request identifier (ReqID).
- the fabric manager collects information about each device installed in a system managed by the FM. This is facilitated by well-known techniques provided by the PCI (and PCI Express) architecture. Each PCI Express device stores information about its various device attributes, including capabilities and/or services supported by the device.
- the attribute information identifies functionality that may be accessed by the PCI Express device, such as mass storage or communication capabilities (via corresponding protocol interfaces), for example.
- the attributes parameter set e.g., one or more attribute parameters in a list
- the attribute information is stored in a table structure 1100, as shown in Figure 11.
- the table structure 1100 corresponds to the lower 256 bits of an AS device's configuration space. It includes a device ID 1102, a vendor ID 1104, a class code 1106, a revision ID 1108, a subsystem ID 1110, a subsystem vendor ID 1112, a capability pointer 1114, and various reserved fields.
- the device ID 1102 comprises a 16-bit value assigned by the manufacturer of the device.
- the vendor ID 1104 is a 16-bit value assigned by PCI-SIG for each vendor that manufacturers PCI Express-compliant devices.
- the class code 1106 is a 24-bit value that indicates the class of the device, as defined by PCI-SIG.
- the subsystem ID 1110 and subsystem vendor ID 1112 are analogous to the device ID 1102 and vendor ID 1104, except they are applicable for devices that include PCI-compliant subsystems.
- the capability pointer 1114 is an 8-bit field designated by the device vendor to indicate the location of the first PCI 2.3 capability record. For AS devices, this field contains a value between 4Oh and OF8h.
- One of the capability records identifies that the device as an AS device, hi general, the capability records are used to provide information identifying services or capabilities provided by a device. The detailed capability information is stored in a separate configuration space (not shown).
- the NumDevs parameter indicates the number of devices the FM should return connection information for if one or more devices are determined to match the requested attributes. If the value is set to 1, connection information corresponding to the first match found will be returned. If the value is set to 0, connection information for each device found will be returned.
- an endpoint Every time an endpoint sends out a request to the FM, it associates an ID with that request, as defined by the ReqID parameter.
- the FM returns that same ReqID when it replies to the request.
- the ID in the reply is matched to an ID in a requests table maintained by the EP.
- the FM Upon receiving a query request, the FM searches its configuration information to determine if any devices coupled to the fabric have attributes matching those contained in the request. In one embodiment, the FM maintains a table for each request. When a device having matching attributes is identified, a Matchlnfo entry is added to the table. The Matchlnfo entry contains connection information for a corresponding target EP, including a "turnpool” and a “turnpointer” (turnptr) value.
- AS provides a source-based routing mechanism called "turn pools" to enable flexible data routing in a variety of system topologies. Turn pools contain routing information that is relative to the system topology and provided by the source.
- the Fabric Manager replies in a block 902 via a Query Reply 1008, which indicates that either no match was found, or includes connection information for one or all target EP 's matching the specified attributes (depending on the NumDevs parameter in Query Request 1006).
- the number of matching targets is identified by the NumDevs parameter in Query Reply 1008.
- the connection information for the one or more target EPs for which a match exists is contained in the DevsTable parameter.
- the requesting EP Upon receiving Query Reply 1008, the requesting EP extracts the connection information, and selects a target EP in situations in which connection information for more than one target EP is returned in the query reply. If no match found is returned, there are no targets that meet the requesting EP 's requirements and the connection process aborts. In a block 904, the requesting EP then sends a Connection Request 1010 directly to the target EP.
- the Connection Request includes the requester's attributes, along with connection attributes.
- connection Request 1010 Upon receipt of Connection Request 1010, the target EP extracts the attribute and connection data from the request. The target then determines if it can and/or is willing to accept the connection or not. For example, if the request specifies an unsupported packet size, a connection should be refused. Connection may also be refused for other reasons, such as for traffic policy considerations. If the connection is refused, the target EP returns a Connection Request Reply 1012 including information indicating an error has occurred. If the connection is accepted, the Connection Request Replay 1012 includes a connection identifier. These operations are shown in a block 906 of Figure 9.
- Connection Request Reply 1012 includes a pipe index or session ID, a sequence number, and the target EP 's identifier. If the requester is going to be a writer (e.g., transmit data to be processed by the target EP), a pipe index is included in Connection Request Reply 1012. The pipe index serves as a connection identifier for the connection. If the request is going to be a reader (e.g., it desires to receive data accessed via the target EP), a session ID is included in Connection Request Reply 1012. hi one embodiment, the target EP's identifier is an extended unique identifier (EUI) (shown as T_EUI in Figure 10), as defined by the IEEE EUI-64 standard for global identifiers.
- EUI extended unique identifier
- An EUI is a 64-bit global identifier that is issued by the IEEE, and is used to uniquely identify a device. The target's EUI is used to notify the FM about the connection status (open/closed).
- Connection Request Reply 1012 further includes a sequence number (SeqNum), to provide a number the requesting EP should start with when sending its first packet.
- SeqNum sequence number
- the Connection Acknowledgment includes the requesting EP's global identifier (R_EUI), which is used to notify the FM about the connection status (open/closed).
- Connection Acknowledgement includes the pipe index previously sent in Connection Request Reply 1012. If the requesting EP is going to be a writer, the session ID included in Connection Request Reply 1012 is returned in the Connection Acknowledgement.
- the Connection Acknowledgement may also include a sequence number (the same as SeqNum) that is incremented by 1, which is used to confirm the sequence number the requesting EP will start with when sending its first packet.
- the target EP In response to a Connection Acknowledgement 1014, the target EP returns a Connection Confirmation 1016 to the requesting EP in a block 910. If the requesting EP is going to be a writer, the Connection Confirmation includes a pipe index, and a pipe offset (e.g., where the requester can start reading/writing to). A pipe access key may also be provided for security purposes. If the requesting EP is going to be a reader, the session ID included in Connection Request Reply 1012 is returned in Connection Confirmation 1016.
- the FM is informed of the connection by sending an Add Connection message 1018 from the target EP to the FM, as depicted in a block 912.
- the Add Connection message may be sent from the requesting EP to the FM.
- the Add Connection message includes the EUI for each of the requesting and target EPs, as well as the requesting EP 's pipe index or session ID (as appropriate) and the target EP's pipe index or session ID (as appropriate).
- the FM keeps a record of each peer-to-peer connection established in its fabric.
- the FM receives an add connection notification, it creates a new entry in its connections table. This entry is removed when the FM receives a remove connection request or it determines that one or both peers are no longer members of the fabric.
- the connections table is a dynamic data structure implemented as a linked list
- the routing topology of a given system may change. For example, new cards or boards may be added to a system using a hot install, or existing cards or boards may be removed.
- the FM may determine that a better path exists between the peer-to-peer connection participants. Li response, the FM notifies both participants of the new path providing the peer's EUI and new turnpool and turnpointer to reach the peer, as depicted by a Path Update message 1020 and a block 914 in Figure 9.
- connections will/should be closed. For example, after a data transaction is completed, the requesting EP may desire to close the connection. There are also situations where connections will remain open between active uses. Connections may also be closed in response to detected conditions, hi one embodiment, the same format is used when either an endpoint wishes to stop a peer-to-peer session or when the FM determines that one of the peers is no longer capable of participating in the connection.
- connection management techniques disclosed herein may be implemented in modular systems that employ serial-based interconnect fabrics, such as PCI Express components.
- PCI Express components may be employed in blade server systems and modular communication systems, such as ATCA systems.
- Typical blade server system and components are shown in Figures 13a and 13b.
- a rack-mounted chassis 1300 is employed to provide power and communication functions for a plurality of server blades (i.e., blades) 1302, each of which occupies a corresponding slot. (It is noted that all slots in a chassis do not need to be occupied.)
- one or more chassis 1300 may be installed in a blade server rack (not shown).
- Each blade is coupled to an interface plane 1304 (i.e., a backplane or mid-plane) upon installation via one or more mating connectors.
- the interface plane will include a plurality of respective mating connectors that provide power and communication signals to the blades.
- many interface planes provide "hot-swapping" functionality - that is, blades can be added or removed (“hot-swapped”) on the fly, without taking the entire chassis down through appropriate power and data signal buffering.
- a typical mid-plane interface plane configuration is shown in Figures 13a and 13b.
- the backside of interface plane 1304 is coupled to one or more power supplies 1306.
- the power supplies are redundant and hot-swappable, being coupled to appropriate power planes and conditioning circuitry to enable continued operation in the event of a power supply failure.
- a plurality of cooling fans 1308 are employed to draw air through the chassis to cool the server blades.
- the illustrated blade server further includes one or more switch fabric cards 1310, each of which is coupled to interface plane 1304, and a management switch card 112 that is coupled to the backside or frontside of the interface plane.
- a switch fabric card is used to perform switching operations for the serial-based interconnect fabric.
- the management switch card provides a management interface for managing operations of the individual blades.
- the management card may also function as a control card that hosts an FM.
- FIG. 14a and 14b An exemplary ATCA chassis 1400 and ATCA board 1402 are shown in Figures 14a and 14b.
- the ATCA chassis is somewhat similar to a blade server chassis, and includes a connection plane (not shown) via which one or more ATCA boards 1402 may be coupled by inserting the board(s) into respective chassis slots.
- the connection plane (a.k.a., backplane) supports data routing between PCI Express devices. In one embodiment, two slots, are reserved for switching boards.
- the ATCA specification supports various types of fabric topologies, such as star, dual-star, and meshes.
- FIG. 14b shows an exemplary ATCA board 1402.
- the ' ATCA board includes a mainboard 1404 comprising a printed circuit board (PCB) to which various components are coupled.
- PCB printed circuit board
- processors 1406 and 1408 are illustrative of various types of processing units, including but not limited to CPUs, NPUs, microcontrollers, and co-processors.
- Various connectors are coupled to mainboard 1404 for power distribution and input/output (I/O) functions. These include a backplane data connector 1422, power input connectors 1424 and 1426, which are configured to coupled to the backplane, and universal serial bus (USB) connectors 1428 and 1430, and a network connector 1432, which are mounted to a front panel 1434.
- I/O input/output
- an ATCA board may include additional components. Such additional components are exemplified by a disk drive 1436 and a daughterboard 1438.
- the ATCA board may also provide mezzanine expansion slots.
- AS fabrics may be employed for both compute and communication ecosystems.
- An exemplary communications implementation is shown in Figure 15.
- Exemplary boards employed in the implementation include a pair of line cards 1500A and 1500B, a pair of switch cards 1502A and 1502B, and a control card 1504.
- Switch cards 1502A and 1502B are represented of an AS fabric 1503.
- Each of line cards 1500A and 1500B include a framer, media access channel (MAC) component, and physical layer (PHY) component, collectively depicted as a component 1506 for convenience.
- the line cards further include a CPU 1508, coupled to memory 1510 and a local line card AS switch element 1512, and a NPU 1514, coupled to memory 1516 and AS switch element 1512.
- component 1506, CPU 1508, and NPU 1514 are coupled to AS switch element 1512 via respective AS links 1518, 1520, and 1522.
- Switch cards 1502 A and 1502B are used to support the AS switch fabric functionality. This is facilitated by AS switch elements 1524.
- Control card 1504 is used to manage the AS switch fabric by controlling the switching operation of switch cards 1502A and 1502B, and includes a CPU sub-system 1526 and memory 1528. In one embodiment, the functionality depicted as being performed by control card 1504 is performed by one of switch cards 1502A or 1502B.
- CPU sub-system 1526 and memory 1528 is illustrative of fabric manager host circuitry that is used to run the fabric manager software components.
- Each of line cards 1500A and 1500B is connected to AS fabric 1503 via a respective AS link 1530A and 1530B.
- control card 1504 is connected to AS fabric 1503 via an AS link 1532.
- Each of line cards 1500A and 1500B functions as an endpoint device 312.
- the software components for an EP device comprising an instance of EP component 302 and AS driver component 306, are loaded into memory 1510 and executed on CPU 1508 (in conjunction with an operating system running on CPU 1508).
- the EP device software components may be stored on a given line card using a persistent storage device, such as but not limited to a disk drive, a read-only memory, or a non- volatile memory (e.g., flash device), which are collectively depicted as storage 1534.
- a persistent storage device such as but not limited to a disk drive, a read-only memory, or a non- volatile memory (e.g., flash device), which are collectively depicted as storage 1534.
- one or more of the software components may comprise a carrier wave that is loaded into memory 1510 via a network.
- Either control card 1504 (if used to manage the AS fabric) or one of switch cards 1502 A or 1504B (if including the equivalent functionality depicted for control card 1504) is used to function as a PFM device 308.
- the PFM device software components including an instance of EP component 302, PFM component 300 and AS driver 306 are loaded into memory 1528.
- the PFM device software components are stored in a persistent storage device, depicted as storage 1536.
- one or more of the PFM device software components are loaded into memory 1528 via a network.
- the code (e.g., instructions) and data that are executed to perform the endpoint, PFM, and SFM operations comprise software elements executed upon some form of processing core (such as the CPU) or otherwise implemented or realized upon or within a machine-readable medium.
- a machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine ⁇ e.g., a computer).
- a machine-readable medium can include such as a read only memory (ROM); a random access memory (RAM); a magnetic disk storage media; an optical storage media; and a flash memory device, etc.
- a machine-readable medium can include propagated signals such as electrical, optical, acoustical or other form of propagated signals ⁇ e.g., carrier waves, infrared signals, digital signals, etc.).
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP05767737A EP1790134A1 (en) | 2004-06-30 | 2005-06-28 | Advanced switching peer-to-peer protocol |
KR1020067027729A KR100871922B1 (en) | 2004-06-30 | 2005-06-28 | Advanced switching peer-to-peer protocol |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/882,902 | 2004-06-30 | ||
US10/882,902 US20060004837A1 (en) | 2004-06-30 | 2004-06-30 | Advanced switching peer-to-peer protocol |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2006004780A1 true WO2006004780A1 (en) | 2006-01-12 |
Family
ID=35262039
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2005/022975 WO2006004780A1 (en) | 2004-06-30 | 2005-06-28 | Advanced switching peer-to-peer protocol |
Country Status (5)
Country | Link |
---|---|
US (1) | US20060004837A1 (en) |
EP (1) | EP1790134A1 (en) |
KR (1) | KR100871922B1 (en) |
CN (1) | CN100413274C (en) |
WO (1) | WO2006004780A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007117944A1 (en) * | 2006-03-31 | 2007-10-18 | Intel Corporation | Backplane interconnection system and method |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7447233B2 (en) * | 2004-09-29 | 2008-11-04 | Intel Corporation | Packet aggregation protocol for advanced switching |
US7343434B2 (en) * | 2005-03-31 | 2008-03-11 | Intel Corporation | Buffer management within SLS (simple load store) apertures for inter-endpoint communication in advanced switching fabric |
US7492710B2 (en) * | 2005-03-31 | 2009-02-17 | Intel Corporation | Packet flow control |
US7526570B2 (en) * | 2005-03-31 | 2009-04-28 | Intel Corporation | Advanced switching optimal unicast and multicast communication paths based on SLS transport protocol |
US7496797B2 (en) * | 2005-03-31 | 2009-02-24 | Intel Corporation | Advanced switching lost packet and event detection and handling |
US7698484B2 (en) * | 2005-09-21 | 2010-04-13 | Ricoh Co., Ltd. | Information processor configured to detect available space in a storage in another information processor |
US8189603B2 (en) * | 2005-10-04 | 2012-05-29 | Mammen Thomas | PCI express to PCI express based low latency interconnect scheme for clustering systems |
US8141148B2 (en) | 2005-11-28 | 2012-03-20 | Threatmetrix Pty Ltd | Method and system for tracking machines on a network using fuzzy GUID technology |
US8763113B2 (en) | 2005-11-28 | 2014-06-24 | Threatmetrix Pty Ltd | Method and system for processing a stream of information from a computer network using node based reputation characteristics |
US20070239869A1 (en) * | 2006-03-28 | 2007-10-11 | Microsoft Corporation | User interface for user presence aggregated across multiple endpoints |
US7945612B2 (en) * | 2006-03-28 | 2011-05-17 | Microsoft Corporation | Aggregating user presence across multiple endpoints |
US9241038B2 (en) * | 2006-05-23 | 2016-01-19 | Microsoft Technology Licensing, Llc | User presence aggregation at a server |
US8800008B2 (en) | 2006-06-01 | 2014-08-05 | Intellectual Ventures Ii Llc | Data access control systems and methods |
US9444839B1 (en) * | 2006-10-17 | 2016-09-13 | Threatmetrix Pty Ltd | Method and system for uniquely identifying a user computer in real time for security violations using a plurality of processing parameters and servers |
CN101110768B (en) * | 2007-06-20 | 2010-10-06 | 杭州华三通信技术有限公司 | Data communication method, system, master control card and cable fastener |
EP2486487B1 (en) | 2009-10-07 | 2014-12-03 | Hewlett Packard Development Company, L.P. | Notification protocol based endpoint caching of host memory |
US8321617B1 (en) * | 2011-05-18 | 2012-11-27 | Hitachi, Ltd. | Method and apparatus of server I/O migration management |
US8954481B2 (en) * | 2012-05-09 | 2015-02-10 | International Business Machines Corporation | Managing the product of temporary groups in a community |
JP5796139B2 (en) * | 2012-10-26 | 2015-10-21 | 華為技術有限公司Huawei Technologies Co.,Ltd. | PCIE switch-based server system, switching method, and device |
JP6155500B2 (en) * | 2014-02-12 | 2017-07-05 | APRESIA Systems株式会社 | Relay device |
CN105743960B (en) * | 2015-07-20 | 2019-09-06 | 浪潮(北京)电子信息产业有限公司 | The management method and device of session connection |
DE112015006969T5 (en) * | 2015-09-25 | 2018-11-08 | Intel Corporation | Communication between integrated circuit assemblies using a millimeter-wave wireless fabric |
CN107800639B (en) * | 2016-09-06 | 2020-04-14 | 华为技术有限公司 | Switching device, switching device group, data transmission method and computer system |
CN109299534B (en) * | 2018-09-20 | 2023-07-25 | 深圳市一博科技股份有限公司 | Modeling method and device for printed circuit board |
US10880371B2 (en) | 2019-03-05 | 2020-12-29 | International Business Machines Corporation | Connecting an initiator and a target based on the target including an identity key value pair and a target characteristic key value pair |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6687240B1 (en) * | 1999-08-19 | 2004-02-03 | International Business Machines Corporation | Transaction routing system |
US7023851B2 (en) * | 2000-10-12 | 2006-04-04 | Signafor, Inc. | Advanced switching mechanism for providing high-speed communications with high Quality of Service |
CN1300721C (en) * | 2002-03-21 | 2007-02-14 | 重庆大学 | Method for realizing peer-to-peer network system architecture |
US7051102B2 (en) * | 2002-04-29 | 2006-05-23 | Microsoft Corporation | Peer-to-peer name resolution protocol (PNRP) security infrastructure and method |
US20040054781A1 (en) * | 2002-07-30 | 2004-03-18 | Heng-Chien Chen | Method for establishing point to point or point to multiple points internet connection(s) |
JP3973548B2 (en) * | 2002-12-10 | 2007-09-12 | 株式会社ソニー・コンピュータエンタテインメント | Network system, network connection establishment method, network terminal, computer program, and recording medium storing program |
CN1506866A (en) * | 2002-12-12 | 2004-06-23 | 上海科星自动化技术有限公司 | Reciprocal network suitable for offices |
US7447208B2 (en) * | 2003-08-04 | 2008-11-04 | Intel Corporation | Configuration access mechanism for packet switching architecture |
US7259961B2 (en) * | 2004-06-24 | 2007-08-21 | Intel Corporation | Reconfigurable airflow director for modular blade chassis |
-
2004
- 2004-06-30 US US10/882,902 patent/US20060004837A1/en not_active Abandoned
-
2005
- 2005-06-28 WO PCT/US2005/022975 patent/WO2006004780A1/en not_active Application Discontinuation
- 2005-06-28 KR KR1020067027729A patent/KR100871922B1/en not_active IP Right Cessation
- 2005-06-28 EP EP05767737A patent/EP1790134A1/en not_active Withdrawn
- 2005-06-30 CN CNB2005100913218A patent/CN100413274C/en not_active Expired - Fee Related
Non-Patent Citations (3)
Title |
---|
CLARK J A ET AL: "BANDWIDTH-ON-DEMAND NETWORKS - SOLUTION TO PEER-TO-PEER FILE SHARING", BT TECHNOLOGY JOURNAL, BT LABORATORIES, GB, vol. 20, no. 1, January 2002 (2002-01-01), pages 53 - 63, XP001108675, ISSN: 1358-3948 * |
GRUBER I ET AL: "PROTOCOL FOR PEER-TO-PEER NETWORKING IN MOBILE ENVIRONMENTS", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS, 20 October 2003 (2003-10-20), pages 1 - 7, XP002291691 * |
ORAM ANDY (ED): "Peer-to-peer: Harnessing the Benefits of a Disruptive Technology passage", PEER-TO-PEER: HARNESSING THE BENEFITS OF A DISRUPTIVE TECHNOLOGY, 15 March 2001 (2001-03-15), pages 94 - 122, XP002259974 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007117944A1 (en) * | 2006-03-31 | 2007-10-18 | Intel Corporation | Backplane interconnection system and method |
US7631133B2 (en) | 2006-03-31 | 2009-12-08 | Intel Corporation | Backplane interconnection system and method |
Also Published As
Publication number | Publication date |
---|---|
EP1790134A1 (en) | 2007-05-30 |
KR100871922B1 (en) | 2008-12-05 |
US20060004837A1 (en) | 2006-01-05 |
CN100413274C (en) | 2008-08-20 |
CN1744546A (en) | 2006-03-08 |
KR20070034537A (en) | 2007-03-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2006004780A1 (en) | Advanced switching peer-to-peer protocol | |
CN102124449B (en) | Method and system for low-overhead data transfer | |
US10148567B2 (en) | System and method for supporting SMA level handling to ensure subnet integrity in a high performance computing environment | |
US6760783B1 (en) | Virtual interrupt mechanism | |
US11469964B2 (en) | Extension resource groups of provider network services | |
US8713295B2 (en) | Fabric-backplane enterprise servers with pluggable I/O sub-system | |
US7103626B1 (en) | Partitioning in distributed computer system | |
US8743872B2 (en) | Storage traffic communication via a switch fabric in accordance with a VLAN | |
US7525957B2 (en) | Input/output router for storage networks | |
JP4242420B2 (en) | Resource sharing independent of OS on many computing platforms | |
TWI357561B (en) | Method, system and computer program product for vi | |
TWI331281B (en) | Method and apparatus for shared i/o in a load/store fabric | |
US7990994B1 (en) | Storage gateway provisioning and configuring | |
US8160070B2 (en) | Fibre channel proxy | |
US8412860B2 (en) | Input/output (I/O) virtualization system | |
US20060271722A1 (en) | Managing transmissions between devices | |
US7688715B2 (en) | Apparatus for providing shelf manager having duplicate ethernet port in ATCA system | |
TW200530837A (en) | Method and apparatus for shared I/O in a load/store fabric | |
JP2004531175A (en) | End node partition using local identifier | |
US7136907B1 (en) | Method and system for informing an operating system in a system area network when a new device is connected | |
US20220350767A1 (en) | Flexible high-availability computing with parallel configurable fabrics | |
US7350014B2 (en) | Connecting peer endpoints | |
JP4855669B2 (en) | Packet switching for system power mode control | |
JP2003196254A (en) | Management of one or more domains in system | |
US11386026B1 (en) | Shell PCIe bridge and shared-link-interface services in a PCIe system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2005767737 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1020067027729 Country of ref document: KR |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: DE |
|
WWP | Wipo information: published in national office |
Ref document number: 1020067027729 Country of ref document: KR |
|
WWP | Wipo information: published in national office |
Ref document number: 2005767737 Country of ref document: EP |