ImageVerifierCode 换一换
格式:PPT , 页数:67 ,大小:862KB ,
资源ID:378680      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
注意:如需开发票,请勿充值!
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-378680.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(Asynchronous vs. Synchronous Design Techniques for NoCs.ppt)为本站会员(刘芸)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

Asynchronous vs. Synchronous Design Techniques for NoCs.ppt

1、Asynchronous vs. Synchronous Design Techniques for NoCs,Robert Mullins,“The Status of the Network-on-Chip Revolution: Design Methods, Architectures and Silicon Implementation”, (Tutorial) International Symposium on System-on-Chip, Tampere, Finland. November 14th, 2005.,2/67,Aims of Tutorial,Highligh

2、t the wide range of system timing alternatives for NoCs Discuss the impact of the choice of timing regime on the architecture of NoC routers Contrast different approaches,3/67,Synchronous to Delay-Insensitive Approaches to System Timing,Synchronous,Delay Insensitive,Global,None,Timing Assumptions,Lo

3、cal Relative,Wire Delay,Less Detection,Sub-System,Local,Isochronic Forks,Multiple clocks,Pausible clocks and locally triggered clock pulses,Bundled Data,Quasi-Delay Insensitive,Local Clocks/ Interaction with data (becoming aperiodic),4/67,System Timing,Approaches to system timing are distinguished b

4、y what delay assumptions they make A number of different approaches to system timing may also be combined: Globally-Asynchronous Locally-Synchronous (GALS) e.g. Synchronous IP interconnected by an asynchronous network,Synchronous On-Chip Networks,6/67,Generic On-Chip Router,7/67,Synchronous Router P

5、ipeline,Router Pipeline may be many stages Increases communication latency Can make packet buffers less effective Incurs pipelining overheads,8/67,Speculative Router Architecture,VC and switch allocation may be performed concurrently: Speculate that waiting packets will be successful in acquiring a

6、VC Prioritize non-speculative requests over speculative ones,Li-Shiuan Peh and William J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers”, In Proceedings HPCA01, 2001.,9/67,Single Cycle Speculative Router,R. D. Mullins, A. West and S. W. Moore, “Low-Latency Virtual-Channel

7、Routers for On-Chip Networks”, In Proceedings ISCA04.,10/67,Single Cycle Speculative Router,Single cycle router made possible by use of speculation Clock period is almost unchanged (compared to pipelined design) Approx. 30 FO4 (simple standard-cell design) Presence of clock simplifies design Arbitra

8、tion Fast combinational matrix arbiters Can easily be extended to handle priority traffic etc. Speculation Aided by the clear notion of a clock “cycle” Simple abort logic (abort detection and actual abort),11/67,Single Cycle Speculative Router,Lochside Chip (2004) 4x4 mesh network, 25mm2 Single Cycl

9、e Routers (router + link = 1 clock) Low common case latency 4 virtual-channels/input 80-bit links 64-bit data + 16-bit control 250MHz (worst-case PVT) 16Gb/s/channel, 0.18um.,TILE,Traffic Generator, Debug & Test,R,R. D. Mullins, A. West and S. W. Moore, “The design and implementation of a low-latenc

10、y on-chip network”, In Proceedings ASP-DAC06,Beyond a Single Global Clock,13/67,Limitations of Fully-Synchronous Networks,1. Difficult to distribute clock Network spread over die & may have irregular layout Minimising skew costs complexity and power Alternatives/extensions to PLL and H-tree: Clock d

11、eskewing techniques Distributed Clock Generator (DCG). Distributed PLLs Standing-wave oscillators and rotary clock schemes Resonant global clocks, optical clock distribution etc.,14/67,Limitations of Fully-Synchronous Networks,2. Single Network Clock Frequency Communicating synchronous IP blocks may

12、 operate at different and potentially adaptive clock frequencies What is most appropriate network clock frequency? We dont want to have to generate and distribute a very high frequency clock in order to emulate an asynchronous network,15/67,Frequency Distribution,Clock skew may force the system to b

13、e partitioned into multiple clock domains Can exploit the fact that only the phase of each routers clock differs, simple error-free clock-domain crossing possible (single clock source),16/67,Router clocks derived from a single source,Each routers clock may be generated from the global network clock,

14、 either by: Clock division or Clock multiplication Clock domain crossing techniques can exploit known clock frequency relationships,Chakraborty and M. Greenstreet, “Efficient Self-Timed Interfaces for Crossing Clock Domains”, In Proceedings ASYNC03 L. F. G. Sarmenta, G. A. Pratt and S. A. Ward, “Rat

15、ional Clocking”, ICCD95,17/67,Locally Generated Clocks (periodic & free-running),Can exploit knowledge about clocks (when crossing clock domains) even if all we know is that they are periodic, examples: predictive synchronizers DallyFrank/Ginosar asynchronous FIFOs Chakraborty/Greenstreet,18/67,Sync

16、hronous Routers with Asynchronous Links,Synchronization: Time Safe: e.g. Traditional 2 FF synchronizers Value Safe: Clock Pausing/Data-driven clocks,19/67,Locally Clocked Routers/Asynchronous Interconnect (GALS style network),Can support asynchronous interconnects No longer exploiting periodic natur

17、e of router clocks Correct operation is independent of the delay of the link GALS interfaces with pausible clocks If necessary clock is stretched, data is always transferred reliably (value safe) Need to construct local delay line,20/67,GALS Clock Pausing,Simple GALS interface (receiver) Note: Req/A

18、ck uses 2-phase handshaking protocol,21/67,GALS Multiple Inputs,Clock is free running (although it can be paused) It is the clock that really determines if asynchronous data is transferred into the synchronous clock domain on a particular cycle Impact on performance in on-chip network requiring mult

19、iple input data/control ports?,22/67,GALS Stoppable Clock,23/67,Local aperiodic clock generation,Discard free-running clock but retain a single delay assumption for router Options for clock pulse generation: Use stoppable GALS interface and attempt to stop every cycle overheads? Wait for data/null-d

20、ata from all neighbours before generating pulse (global synchrony!) Data driven clock Traditional asynchronous bundled-data approach (with a single delay assumption for whole router) Can still exploit synchronous router implementation,24/67,Data-Driven Local Clock,Idea: If data at any input, sample

21、all inputs Determine which inputs are to be admitted on next clock cycle (requires MUTEX) Ensure data that is not admitted is locked out for next clock cycle After all MUTEXes have made a decision (and never faster than the delay line!) generate a clock pulse Similarities to stoppable GALS interface

22、 and asynchronous priority arbiters,25/67,Data-Driven Clock Waveform,26/67,Data-Driven Clock Waveform,Imagine data from two packets arriving at a single router node at different rates An aperiodic clock may be generated to minimise latency and power Minimum clock period set by delay line Value safe

23、synchronization (no chance data is ever lost),27/67,Data-Driven Local Clock,Updated: June 2006,May be generalized to n-input ports. Only the control interfaces are shown here (r1,a2 and r2,a2) grantn is simply used to control the latching of data at each input port (register enable),28/67,Data-Drive

24、n Local Clock,Simple implementation shown (work in progress) Some small timing constraints Performance tweaks possible Possible Extensions Force synchronization on subset of inputs Some inputs must be present for clock to be generated Generate additional clock pulses to handle pipelining Counter & c

25、lock driven lock signal Select a different clock period (delay line) depending on which inputs have been granted Data-dependent clock period,See also: M. Krstic and E. Grass, “New GALS Technique for Datapath Architectures”, PATMOS 2003. (and ASYNC05 paper),29/67,Clocking alternatives for Synchronous

26、 Routers,30/67,Synchronous Routers - Summary,Can design high-performance single cycle routers Design is simplified by presence of global synchrony Distribution of global clock can be eased by: New clock generation/distribution techniques Source synchronous communication Network operating frequency R

27、elax global synchrony further Data-driven clocking determines most appropriate router clock frequency automatically,Asynchronous On-Chip Networks,32/67,Why are asynchronous NoCs interesting?,Simple/elegant solution when networked IP blocks run at different clock frequencies Data driven, no superfluo

28、us switching activity No synchronization/clock alignment issues at interfaces Ability to exploit data/path-dependent delays Low-latency common or high-priority paths through router No clock distribution issues Security and EMI advantages Clock focuses EM emissions The presence of a clock can also ai

29、d fault-induction and side-channel analysis attacks,33/67,Why are asynchronous NoCs interesting?,Freedom to optimize network links Not constrained by need to distribute/generate multiple clock frequencies. Can exploit high-frequency narrow links. Dynamic latency/throughput trade-offs (adaptive pipel

30、ine depth) Exploit dynamic optimizations on links (e.g. DVS)Reduced design time Easy to use interfaces, modularity. Robust and simple implementation Some arguments for reduced power,34/67,Asynchronous Circuit Basics,Control in asynchronous circuits often relies on simple handshaking protocols (req/a

31、ck event cycles) Delay-insensitive event-driven system - every signal transition is acknowledged The C-element is a fundamental building block of many asynchronous circuits Can be thought of as a AND-gate for events,35/67,Simple Pipelines,Event FIFO,Micropipeline,I. E. Sutherland, “Micropipelines”,

32、Communications of the ACM, Vol. 32, Issue 6 (June 1989).,36/67,Arbitration,37/67,Tree Arbiter Element,M. B. Josephs and J. T. Yantchev, “CMOS Design of the Tree Arbiter Element”, IEEE Trans. On VLSI Systems 4(4), pp.472-476, Dec. 1996 J. Bainbridge, “Asynchronous System-on-Chip Interconnect”, Ph.D.

33、Thesis, Dept. of Computer Science, University of Manchester.,38/67,Multiway Arbiters,39/67,Static Priority Arbiters,“Priority Arbiters” Bystrov/Kinniment/Yakovlev (ASYNC00) First stage samples/locks current request vector Static or dynamic priority Original design updated to tackle performance and Q

34、oS issues Felicijan/Bainbridge/Furber (ICM03),40/67,Delay-Insensitive Communication,4-phase dual-rail protocol,REQ+,2.ACK+,3.REQ-,4.ACK-,1,ACK_out+,0,0,ACK_in-,D0=0,D0=0,41/67,Delay-Insensitive Switched Interconnect,The basic DI latch can be extended to support steering, multiplexing and arbitration

35、,J. Bainbridge and S. Furber, “CHAIN: A Delay-Insensitive Chip Area Interconnect”, IEEE Micro, Vol. 22, No. 5, 2002,42/67,CHAIN,Basic link is 6 wires 2-bits of data (1-of-4) + end of packet + ack any N-of-M code could be used around 1Gbps (0.18um, 160Mbps per wire) Links may be ganged together Route

36、 information tapped off and used to steer remainder of packet If arbitration is required, arbiter grant is retained for duration of packet (no fragmentation of packets),43/67,Asynchronous on-chip networks,How do we build more complex on-chip routers? Support for virtual-channels QoS Challenges Multi

37、-way & prioritised arbitration Control overheads Arbitration and DI circuits can be slow! How can control overheads be hidden?,44/67,Overview of Some Published Asynchronous On-Chip Networks,“Quality-of-Service (QoS) for Asynchronous On-Chip Networks” T. Felicijan (Ph.D. 2004, Manchester)http:/www.cs

38、.manchester.ac.uk/apt/publications/“An Asynchronous Router for Multiple Service Levels Networks on Chip”, R. Dobkin et al, ASYNC05. (QNoC Group)MANGO Clockless Network-on-Chip “A Scheduling Discipline for Latency and Bandwidth Guarantees in Asynchronous Network-on-Chip”, T. Bjerregaard and J. Spars,

39、 ASYNC05. “A router Architecture for Connection-Orientated Service Guarantees in the MANGO Clockless Network-on-Chip”, T. Bjerregaard and J. Spars, DATE05,45/67,Virtual Channels,Best Effort Routers Virtual-Channel allocation is performed at each router any free VC (at the required output) may be ass

40、igned to a new packet Significant performance gains over simpler static schemes Can also prioritize packets QoS Routers based on Static VC allocation Packets retains the same VC throughout the network. Each VC is assigned a static priority level Connection-Orientated Router VCs are reserved at each

41、router along a path to create a connection Hard QoS guarantees possible,46/67,QoS Support,All these asynchronous networks provide QoS support MANGO Guaranteed Service (GS) connections A connection is a reserved sequence of VCs through the network Hard latency and bandwidth guarantees are provided,47

42、/67,Static VC assignments,FelicijanDobkin implement QoS through static VC assignments i.e. packet is assigned VC and uses this VC at all routers May need to contend with other packets assigned the same VC Packets with same VC cannot be interleaved VC is reserved for duration of packet (reserved rath

43、er than allocated from pool of free VCs),48/67,Felicijan/Manchester,49/67,Felicijan/Manchester,Implementation style: QDI, 1-of-4 encoded data with RTZ signalling Simplest switching network of asynchronous designs (multiplexed crossbar) 8-bit data flits Performance Results (0.18um) Maximum router fre

44、quency 300MHz Minimum router latency 5ns? Two constraints on provision of QoS First due to multiplexed crossbar Second related to minimum buffer requirements,50/67,Dobkin/Technion,51/67,Dobkin/Technion,4 service levels (statically assigned VCs) Implementation style: bundled data Significant area red

45、uction over QDI approach 8-bit data flits Synchronous versus Asynchronous router study Throughput is reported to be similar Minimum Latency (head flit) input to output (0.35um, typ. PVT) Synchronous 3.7ns Asynchronous 13.0ns (x3.5),52/67,MANGO Clockless Network-on-chip,53/67,MANGO Clockless Network-

46、on-chip,Non-blocking switching network means link access arbitration is all that must be considered for hard QoS guarantees VCs are assigned statically (no contention) Simple BE router used to program GS router (not shown) Basic Static Priority Arbiter (SPA) is preceded by admission control logic Pa

47、rt of Asynchronous Latency Guarantee (ALG) scheduling algorithm (see ASYNC05 paper) Prevents lower priority flits being stalled more than once by each higher priority flit,54/67,MANGO Clockless Network-on-chip,515MHz port speed (WC, 0.13um) 32-bit data flits Implementation style: Internally uses a b

48、undled-data (RTZ) circuit style Links use a DI two-phase encoding Router Latency 5.2ns Switch 2.1ns, VC Buffers/Control 1.2ns VC merge 1.6ns MANGO provides hard latency/throughput guarantees unlike other VC prioritization based schemes,Low-Latency Best-Effort Asynchronous Networks,56/67,Improving Ne

49、twork Latency,Asynchronous router latency can be high Fine-grain pipelining can provide good throughput figures but control overheads can extend latency Completion detection, RTZ phase, H/S Fast combinational matrix arbiters have also been replaced by cascaded MUTEXes or complex priority arbiters Ov

50、erheads even greater in a BE router that must allocate VCs dynamically Approaches to reduce latency? Speculation Decoupled control and data networks,57/67,Low-Latency Asynchronous Routers,Exploit speculation? Use Priority arbiter organisation Assume only a single grant will be present after lock is asserted Use MUTEX grant outputs to steer data immediately Issues Complex abort procedure? Invalid data and DI encoding? Careful not to make common-case slower,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1