Architecture of Parallel ComputersCSC - ECE 506 .ppt

上传人:tireattitude366 文档编号:378549 上传时间:2018-10-09 格式:PPT 页数:37 大小:287.50KB
下载 相关 举报
Architecture of Parallel ComputersCSC - ECE 506 .ppt_第1页
第1页 / 共37页
Architecture of Parallel ComputersCSC - ECE 506 .ppt_第2页
第2页 / 共37页
Architecture of Parallel ComputersCSC - ECE 506 .ppt_第3页
第3页 / 共37页
Architecture of Parallel ComputersCSC - ECE 506 .ppt_第4页
第4页 / 共37页
Architecture of Parallel ComputersCSC - ECE 506 .ppt_第5页
第5页 / 共37页
亲,该文档总共37页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

1、Architecture of Parallel Computers CSC / ECE 506 OpenFabrics Alliance Lecture 18,7/17/2006Dr Steve Hunter,CSC / ECE 506,Outline,Infiniband and Ethernet Review DDP and RDMA OpenFabrics Alliance IP over Infiniband (IPoIB) Sockets Direct Protocol (SDP) Network File System (NFS) SCSI RDMA Protocol (SRP)

2、 iSCSI Extensions for RDMA (iSER) Reliable Datagram Sockets (RDS),CSC / ECE 506,Infiniband Goals - Review,Interconnect for server I/O and efficient interprocess communications Standard across the industry backed by all the major players 200+ companies With an architecture able to match future system

3、s: Low overhead Scalable bandwidth, up and down Scalable fanout, few to thousands Low cost, excellent price/performance Robust reliability, availability, and serviceability Leverages Internet Protocol suite and paradigms,CSC / ECE 506,The Basic Unit: an IB Subnet - Review,Basic whole IB system is a

4、subnet Elements: Endnodes Links Switches What it does: Communicate endnodes with endnodes, via message queues, which process messages over several transport types, and are SARed into packets, which are placed on links, and routed by switches.,End Node,Switch,End Node,End Node,End Node,End Node,End N

5、ode,End Node,End Node,Switch,Switch,End Node,End Node,Switch,Links,CSC / ECE 506,End Node Attachment to IB - Review,End nodes attach to IB via Channel Adapters: Host CAs (HCAs) O/S API/KPIs not specified Queues and memory accessible via verbs QP, CQ, and RDMA engines Must support three IB Transports

6、 Can include: Dual ports load balancing, availability (path migration) Attach to same or different subnets Partitioning Atomics, Target CAs (TCAs) Queue access method is vendor unique QP and CQ engines Need only support Unreliable Datagram ULP can be standard or proprietary In other words A smaller

7、subset of required functions.,CSC / ECE 506,Infiniband Summary,InfiniBand architecture is a very high performance, low latency interconnect technology based on an industry-standard approach to Remote Direct Memory Access (RDMA) An InfiniBand fabric is built from hardware and software that are config

8、ured, monitored and operated to deliver a variety of services to users and applications Characteristics of the technology that differentiate it from comparative interconnects such as the traditional Ethernet include: End-to-end reliable delivery, Scalable bandwidths from 10 to 60 Gbps available toda

9、y moving to 120 Gbps in the near future Scalability without performance degradation Low latency between devices Greatly reduced server CPU utilization for protocol processing Efficient I/O channel architecture for network and storage virtualizations,CSC / ECE 506,Advanced Ethernet - Review,TCP/IP Mo

10、del,Ethernet,Examples,IP,TCP, UDP,Copper, Optical,HTTP, SMTP, FTP,Physical,Link,Network,Transport,Application,RDMA NIC (RNIC),SCSI,iSER / RNIC Model shown with SCSI application,Physical,Media Access Control (MAC),Internet Protocol (IP),Direct Data Placement (DDP),Transmission Control Protocol (TCP),

11、SCSl app,iSCSI Extensions for RDMA (iSER),Internet SCSI (iSCSI),Markers with PDU Alignment (MPA),Remote Direct Memory Access Protocol (RDMAP),MAC Service,IP Service,TCP Service,RDMA Service,SCSI Service,Its expected the OpenFabrics effort (i.e., OpenIB / OpenRDMA merger) will enable even more advanc

12、ed functions into NIC technology,CSC / ECE 506,Advanced Ethernet Summary,The iWARP technology, implemented as RDMA Network Interface Card (RNIC), achieves Zero-copy, RDMA, and protocol offload over existing TCP/IP networks It was demonstrated that a 10GbE based RNIC can reduce the CPU processing ove

13、rhead from 80-90% to less than 10% comparing to its host stack equivalent Additionally, its achievable end-to-end latency is now 5 microseconds or less. iWARP together with the emerging low latency (low hundreds of nanoseconds) 10 GbE switches can also provide a powerful infrastructure for clustered

14、 computing, server-to-server processing, visualization and file system The advantage of the iWARP technology includes its ability to leverage the widely deployed TCP/IP infrastructure, its broad knowledge base, and mature management and monitoring capabilities. In addition, an iWARP infrastructure i

15、s a routable infrastructure, thereby eliminating the need for gateways to connect to the LAN or WAN internet.,CSC / ECE 506,DDP and RDMA,IETF RFC http:/ central idea of general-purpose DDP is that a data sender will supplement the data it sends with placement information that allows the receivers ne

16、twork interface to place the data directly at its final destination without any copying. DDP can be used to steer received data to its final destination, without requiring layer- specific behavior for each different layer. Data sent with such DDP information is said to be tagged. The central compone

17、nts of the DDP architecture are the “buffer”, which is an object with beginning and ending addresses, and a method (set(), which sets the value of an octet at an address. In many cases, a buffer corresponds directly to a portion of host user memory. However, DDP does not depend on this; a buffer cou

18、ld be a disk file, or anything else that can be viewed as an addressable collection of octets.,CSC / ECE 506,DDP and RDMA,Remote Direct Memory Access (RDMA) extends the capabilities of DDP with two primary functions. It adds the ability to read from buffers registered to a socket (RDMA Read). This a

19、llows a client protocol to perform arbitrary, bidirectional data movement without involving the remote client. When RDMA is implemented in hardware, arbitrary data movement can be performed without involving the remote host CPU at all. RDMA specifies a transport-independent untagged message service

20、(Send) with characteristics that are both very efficient to implement in hardware, and convenient for client protocols. The RDMA architecture is patterned after the traditional model for device programming, where the client requests an operation using Send-like actions (programmed I/O), the server p

21、erforms the necessary data transfers for the operation (DMA reads and writes), and notifies the client of completion. The programmed I/O+DMA model efficiently supports a high degree of concurrency and flexibility for both the client and server, even when operations have a wide range of intrinsic lat

22、encies.,CSC / ECE 506,OpenFabrics Alliance,The OpenFabric Alliance is an international organization comprised of industry, academic and research groups that have developed a unified core of open source software stacks (OpenSTAC) leveraging RDMA architectures for both the Linux and Windows operating

23、systems over both InfiniBand and Ethernet. RDMA is a communications technique allowing data to be transmitted from the memory of one computer to the memory of another computer without passing through either devices CPU, without needing extensive buffering, and without calling to an operating system

24、kernel The core OpenSTAC software supports all the well known standard upper layer protocols such as MPI, IP, SDP, NFS, SRP, iSER, and RDS on top of Ethernet and InfiniBand (IB) infrastructures The OpenFabric software and supporting services better enables low-latency InfiniBand and 10 GbE to delive

25、r clustered computing, server-to-server processing, visualization and file system access,CSC / ECE 506,OpenFabrics Software Stack,Common,InfiniBand,iWARP,Key,InfiniBand HCA,iWARP R-NIC,Hardware Specific Driver,Hardware Specific Driver,Connection Manager,MAD,InfiniBand Verbs / API,SA Client,Connectio

26、n Manager,Connection Manager Abstraction (CMA),User Level Verbs / API,SDP,IPoIB,SRP,iSER,RDS,UDAPL,SDP Library,User Level MAD API,Open SM,Diag Tools,Hardware,Provider,Mid-Layer,Upper Layer Protocol,User APIs,Kernel Space,User Space,NFS-RDMA RPC,Cluster File Sys,Application Level,SMA,R-NIC Driver API

27、,Clustered DB Access (Oracle 10g RAC),Sockets Based Access (IBM DB2),Various MPIs,Access toFile Systems,Block Storage Access,IP Based App Access,Apps & Access Methods for using OF Stack,CSC / ECE 506,IP over IB (IPoIB),IETF Standard for mapping Internet protocols to Infiniband IETF IPoIB Working Gro

28、up Covers Fabric initialization Multicast/Broadcast Address resolution (IPv4/IPv6) IP Datagram encapsulation (IPv4/IPv6) MIBs,CSC / ECE 506,IP over IB (IPoIB),Communication Parameters Obtained from Subnet Manager (SM) P_Key (Partition Key) SL (Service Level) Path Rate Link MTU (for IPv6 can be reduc

29、ed with router advert) GRH parameters TClass, Flow Label, HopLimit Obtained from address resolution Data Link Layer Address (GID) Perstent Data Link layer address necessary Enables IB Routers to be deployed eventually QPN (queue pair number),CSC / ECE 506,IP over IB (IPoIB),Address Resolution IPv4 A

30、RP request is sent on Broadcast MGID ARP reply is unicast back and contains GID and QPN IPv6 Neighbor discovery using all IP-hosts multicast address Existing RFCsSummary Feels like Ethernet with 2KB MTU Doesnt utilize most of Inifinband custom hardware e.g., SAR, Reliable Transport, Zero Copy, RDMA

31、Reads/Writes, Kernel Bypass SDP is the enhanced version,CSC / ECE 506,Sockets Direct Protocol (SDP),Based on Microsofts Winsock Direct Protocol SDP Feature Summary Maps sockets SOCK_STREAM to RDMA semantics Optimizations for transaction oriented protocols Optimizations for mixing of small and large

32、messages Uses advanced Infiniband features Reliable Connected (RC) service Uses RDMA Writes, Reads, and Sends Supports Automatic Path Migration,CSC / ECE 506,SDP Terminology,Data Source Side of connection which is sourcing the ULP data to be transferred Data Sink Side of connection which is receivin

33、g (sinking) the ULP data Data Transfer Mechanism To move ULP data from Data Source to Data Sink (e.g., Bcopy, Receiver Initiated Zcopy, Read Zcopy) Flow Control Mode State that the half connection is currently in (Combined, Pipelined, Buffered) Bcopy Threshold If message length is under threshold, u

34、se Bcopy mechanism. Threshold is locally defined.,CSC / ECE 506,SDP Modes,Flow Control Modes restrict data transfer mechanismsBuffered Mode Used when receiver wishes to force all transfers to use the Bcopy Mechanism Combined Mode Used when receiver is not pre-posting buffers and uses peek/select int

35、erface (Bcopy or Read Zcopy, only one outstanding) Pipelined Mode Highly optimized transfer mode multiple write or read buffers outstanding, can use all data transfer mechanisms (Bcopy, Read Zcopy, Receive Initiated Write Zcopy),CSC / ECE 506,SDP Terminology,Enables buffer-copy when Transfer is shor

36、t Application needs buffering Enables zero-copy when Transfer is long,CSC / ECE 506,Network File System (NFS),Network File System (NFS) is a protocol originally developed by Sun Microsystems in 1984 and defined in RFCs 1094, 1813, and 3530 (obsoletes 3010), as a distributed file system which allows

37、a computer to access files over a network as easily as if they were on its local disks. NFS is one of many protocols built on the Open Network Computing Remote Procedure Call system (ONC RPC) Version 2 of the protocol originally operated entirely over UDP and was meant to keep the protocol stateless

38、, with locking (for example) implemented outside of the core protocol Version 3 added: support for 64-bit file sizes and offsets, to handle files larger than 4GB support for asynchronous writes on the server, to improve write performance; additional file attributes in many replies, to avoid the need

39、 to refetch them; a READDIRPLUS operation, to get file handles and attributes along with file names when scanning a directory; assorted other improvements.,CSC / ECE 506,Network File System (NFS),Version 4 (RFC 3530) Influenced by AFS and CIFS, includes performance improvements, mandates strong secu

40、rity, and introduces a stateful protocol. Version 4 was the first version developed with the Internet Engineering Task Force (IETF) after Sun Microsystems handed over the development of the NFS protocols.Various side-band protocols have been added to NFS, including: The byte-range advisory Network L

41、ock Manager (NLM) protocol which was added to support System V UNIX file locking APIs. The remote quota reporting (RQUOTAD) protocol to allow NFS users to view their data storage quotas on NFS servers.WebNFS is an extension to Version 2 and Version 3 which allows NFS to be more easily integrated int

42、o Web browsers and to enable operation through firewalls.,CSC / ECE 506,SCSI RDMA Protocol (SRP),SRP defines a SCSI protocol mapping onto the InfiniBand Architecture and/or functionally similar cluster protocols RDMA Consortium voted to create iSER instead of porting SRP to IP SRP doesnt have a wide

43、 following SRP doesnt have a discovery or management protocol Version 2 of SRP hasnt been updated for 1.5 years,CSC / ECE 506,iSCSI Extensions for RDMA (iSER),iSER combines SRP and iSCSI with new RDMA capabilities iSER is maintained as part of iSCSI in IETF Recently extended to IB by IBM, Voltaire,

44、HP, EMC, and others Benefits to add iSER to IB Combines same (almost) storage protocol across all RDMA Networks Easier to train staff Bridging products more staight-forward Motivate storage community to iSCSI/iSER mentality and may help with acceptance on IP Desire for a common Discovery and Managem

45、ent protocol across iSCSI, iSER/iWARP, and IP i.e., same Management and discovery process and software to handle IP networks and IB networks,CSC / ECE 506,iSCSI Extensions for RDMA (iSER),iSCSIs main performance deficiencies stem from TCP/IP TCP is a complex protocol requiring significant processing

46、 Stream based, making it hard to separate data and headers Requires copies that increase latency and CPU overhead Using checksums requiring additional CRCs in the ULP iSER eliminates the bottlenecks through: Zero copy using RDMA CRC calculated by hardware Work with message boundaries instead of stre

47、ams Transport protocol implemented in hardware (minimal CPU cycles per iO),CSC / ECE 506,iSCSI Extensions for RDMA (iSER),iSER leverages on iSCSI management, discovery, and RAS Zero-Configuration, Discovery and global storage name server (SLP, iSNS) Change Notifications and active monitoring of devi

48、ces and initiators High-Availability, and 3 levels of automated recovery Multi-pathing and storage aggregation Industry standard management interfaces (MIB) 3rd party storage managers Security: Partitioning, Authentication, Central login control, etc. Working with iSER over IB doesnt require any cha

49、nges Focused effort from both communities More advanced than SRP,CSC / ECE 506,iSCSI Extensions for RDMA (iSER),iSCSI specification: http:/www.ietf.org/rfc/rfc3720.txt iSER and DA Introduction http:/www.rdmaconsortium.org/home/iSER_DA_intro.pdf iSER specification http:/www.ietf.org/internet-drafts/draft-ietf-ips-iser-05.txt iSER over IB Overview http:/ / ECE 506,Reliable Datagram Sockets (RDS),

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 教学课件 > 大学教育

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1