1、1,Advanced Charm+ Tutorial,Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19,2,How to Become a Charm+ Hacker,Advanced Charm+ Advanced Messaging Writing system libraries Groups Delegation Communication framework Advanced load-balancing Checkpointing Threads SDAG,3,Adv
2、anced Messaging,4,Prioritized Execution,If several messages available, Charm will process the message with highest priority Otherwise, oldest message (FIFO) Has no effect: If only one message is available (common for network-bound applications!) On outgoing messages Very useful for speculative work,
3、 ordering timesteps, etc.,5,Priority Classes,Charm+ scheduler has three queues: high, default, and low As signed integer priorities: -MAXINT Highest priority - -1 0 Default priority 1 - +MAXINT Lowest priority As unsigned bitvector priorities: 0x0000 Highest priority - 0x7FFF 0x8000 Default priority
4、 0x8001 - 0xFFFF Lowest priority,6,Prioritized Marshalled Messages,Pass “CkEntryOptions” as last parameter For signed integer priorities: CkEntryOptions opts; opts.setPriority(-1); fooProxy.bar(x,y,opts);For bitvector priorities: CkEntryOptions opts; unsigned int prio2=0x7FFFFFFF,0xFFFFFFFF; opts.se
5、tPriority(64,prio); fooProxy.bar(x,y,opts);,7,Prioritized Messages,Number of priority bits passed during message allocationFooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messagesSigned integer priorities: *CkPriorityPtr(msg)=-1; CkSetQueueing(m, CK_QUEUEING_IFIFO); Unsigned
6、 bitvector priorities CkPriorityPtr(msg)0=0x7fffffff; CkSetQueueing(m, CK_QUEUEING_BFIFO);,8,Advanced Message Features,Read-only messages Entry method agrees not to modify or delete the message Avoids message copy for broadcasts, saving time Expedited messages Message do not go through the charm+ sc
7、heduler (faster) Immediate messages Entries are executed in a interrupt or the communication thread Very fast, but tough to get right,9,Read-Only, Expedited, Immediate,All declared in the .ci file.entry nokeep void foo_readonly(Msg *);entry expedited void foo_exp(Msg *);entry immediate void foo_imm(
8、Msg *);/ Immediate messages only currently work /for NodeGroups,10,Groups,11,Object Groups,A collection of objects (chares) Also called branch office chares Exactly one representative on each processor Ideally suited for system libraries A single proxy for the group as a whole Similar to arrays: Bro
9、adcasts, reductions, indexing But not completely like arrays: Non-migratable; one per processor,12,Declarations,.ci filegroup mygroup entry mygroup(); /Constructorentry void foo(foomsg *); /Entry method; C+ file class mygroup : public Group mygroup() void foo(foomsg *m) CkPrintf(“Do Nothing”); ;,13,
10、Creating and Calling Groups,Creation p = CProxy_mygroup:ckNew(); Remote invocation p.foo(msg); /broadcast p1.foo(msg); /asynchronous invocation Direct local access mygroup *g=p.ckLocalBranch(); g-foo(.); /local invocation Danger: if you migrate, the group stays behind!,14,Delegation,15,Delegation,En
11、ables Charm+ proxy messages to be forwarded to a delegation manager group Delegation manager can trap calls to proxy sends and apply optimizations Delegation manager must inherit from CkDelegateMgr User program must to call proxy.ckDelegate(mgrID);,16,Delegation Interface,.ci file group MyDelegateMg
12、r entry MyDelegateMgr(); /Constructor; .h file class MyDelegateMgr : public CkDelegateMgr MyDelegateMgr();void ArraySend(.,int ep,void *m,const CkArrayIndexMax ,17,Communication Optimization,18,Automatic Communication Optimizations,The parallel-objects Runtime System can observe, instrument, and mea
13、sure communication patterns Communication libraries can optimize By substituting most suitable algorithm for each operation Learning at runtime E.g. All to all communication Performance depends on many runtime characteristics Library switches between different algorithms Communication is from/to obj
14、ects, not processors Streaming messages optimization,19,Managing Collective Communication,Communication operation where all (or most) the processors participate For example broadcast, barrier, all reduce, all to all communication etc Applications: NAMD multicast, NAMD PME, CPAIMD Issues Performance
15、impediment Nave implementations often do not scale Synchronous implementations do not utilize the co-processor effectively,20,All to All Communication,All processors send data to all other processors All to all personalized communication (AAPC) MPI_Alltoall All to all multicast/broadcast (AAMC) MPI_
16、Allgather,21,Strategies For AAPC,Short message optimizations High software over head () Message combining Large messages Network contention,22,Short Message Optimizations,Direct all to all communication is dominated Message combining for small messages Reduce the total number of messages Multistage
17、algorithm to send messages along a virtual topology Group of messages combined and sent to an intermediate processor which then forwards them to their final destinations AAPC strategy may send same message multiple times,23,Virtual Topology: Mesh,Organize processors in a 2D (virtual) Mesh,Phase 1: P
18、rocessors send messages to row neighbors,Message from (x1,y1) to (x2,y2) goes via (x1,y2),Phase 2: Processors send messages to column neighbors,2* messages instead of P-1,24,AAPC Performance,25,Large Message Issues,Network contention Contention free schedules Topology specific optimizations,26,Ring
19、Strategy for Collective Multicast,Performs all to all multicast by sending messages along a ring formed by the processors Congestion free on most topologies,27,Streaming Messages,Programs often have streams of short messages Streaming library combines a bunch of messages and sends them off Stripping
20、 large charm+ header Short array message packing Effective message performance of about 3us,28,Using communication library,Communication optimizations embodied as strategies EachToManyMulticastStrategy RingMulticast PipeBroadcast Streaming MeshStreaming,29,Bracketed vs. Non-bracketed,Bracketed Strat
21、egies Require user to give specific end points for each iteration of message sends Endpoints declared by calling ComlibBegin() and ComlibEnd() Examples: EachToManyMulticast Non bracketed strategies No such end points necessary Examples: Streaming, PipeBroadcast,30,Accessing the Communication Library
22、,From mainchare:main Creating a strategy Strategy *strat = new EachToManyMulticastStrategy(USE_MESH) Strat = new StreamingStrategy(); Strat-enableShortMessagePacking();Associating a proxy with a StrategyComlibAssociateProxy(strat, myproxy); myproxy should be passed to all array elements,31,Sending M
23、essages,ComlibBegin(myproxy);/Bracketed Strategies for( ) myproxy.foo(msg); ComlibEnd(); /Bracketed strategies,32,Handling Migration,Migrating array element PUPs the comlib associated proxyFooArray:pup(PUP:er ,33,Compiling,You must include compile time option module commlib,34,Advanced Load-balancer
24、s Writing a Load-balancing Strategy,35,Advanced load balancing: Writing a new strategy,Inherit from CentralLB and implement the work() functionclass foolb : public CentralLB public:void work (CentralLB:LDStats* stats, int count);,36,LB Database,struct LDStats ProcStats *procs; LDObjData* objData; LD
25、CommData* commData; int *to_proc;/ /Dummy Work function which assigns all objects to /processor 0 /Dont implement it! void fooLB:work(CentralLB:LDStats* stats,int count)for(int count=0;count nobjs; count+) stats.to_proccount = 0; ,37,Compiling and Integration,Edit and run Makefile_lb.sh Creates Make
26、.lb which is included by the LDB Makefile Run make depends to correct dependencies Rebuild charm+,38,Checkpoint Restart,39,Checkpoint/Restart,Any long running application must be able to save its state When you checkpoint an application, it uses the pup routine to store the state of all objects Stat
27、e information is saved in a directory of your choosing Restore also uses pup, so no additional application code is needed (pup is all you need),40,Checkpointing Job,In AMPI, use MPI_Checkpoint(); Collective call; returns when checkpoint is complete In Charm+, use CkCheckpoint(,); Called on one proce
28、ssor; calls resume when checkpoint is complete,41,Restart Job from Checkpoint,The charmrun option +restart is used to restart Number of processors need not be the same You can also restart groups by marking them migratable and writing a PUP routine they still will not load balance, though,42,Threads
29、,43,Why use Threads?,They provide one key feature: blocking Suspend execution (e.g., at message receive) Do something else Resume later (e.g., after message arrives) Example: MPI_Recv, MPI_Wait semantics Function call interface more convenient than message-passing Regular call/return structure (no C
30、kCallbacks) Allows blocking in middle of deeply nested communication subroutine,44,Why not use Threads?,Slower Around 1us context-switching overhead unavoidable Creation/deletion perhaps 10us More complexity, more bugs Breaks a lot of machines! (but we have workarounds) Migration more difficult Stat
31、e of thread is scattered through stack, which is maintained by compiler By contrast, state of object is maintained by users Thread disadvantages form the motivation to use SDAG (later),45,What are (Charm) Threads?,One flow of control (instruction stream) Machine Registers non-preemptive Migratable b
32、etween nodes,46,How do I use Threads?,Many options: AMPI Always uses threads via TCharm library Charm+ threaded entry methods run in a thread sync methods Converse C routines CthCreate/CthSuspend/CthAwaken Everything else is built on these Implemented using SYSV makecontext/setcontext POSIX setjmp/a
33、lloca/longjmp,47,How do I use Threads (example),Blocking API routine: find array element int requestFoo(int src) myObject *obj=.;return obj-fooRequest(src) Send request and suspend int myObject:fooRequest(int src) proxydest.fooNetworkRequest(thisIndex);stashed_thread=CthSelf();CthSuspend(); / - bloc
34、ks until awaken call -return stashed_return; Awaken thread when data arrives void myObject:fooNetworkResponse(int ret) stashed_return=ret;CthAwaken(stashed_thread); ,48,How do I use Threads (example),Send request, suspend, recv, awaken, return int myObject:fooRequest(int src) proxydest.fooNetworkReq
35、uest(thisIndex);stashed_thread=CthSelf();CthSuspend(); return stashed_return; ,void myObject:fooNetworkResponse(int ret) stashed_return=ret;CthAwaken(stashed_thread); ,49,The Horror of Thread Migration,50,Stack Data,The stack is used by the compiler to track function calls and provide temporary stor
36、age Local Variables Subroutine Parameters C “alloca” storageMost of the variables in a typical application are stack data Users have no control over how stack is laid out,51,Migrate Stack Data,Without compiler support, cannot change stacks address Because we cant change stacks interior pointers (ret
37、urn frame pointer, function arguments, etc.) Solution: “isomalloc” addresses Reserve address space on every processor for every thread stack Use mmap to scatter stacks in virtual memory efficiently Idea comes from PM2,52,Migrate Stack Data,Thread 2 stack,Thread 3 stack,Thread 4 stack,Processor As Me
38、mory,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Thread 1 stack,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Processor Bs Memory,Migrate Thread 3,53,Migrate Stack Data: Isomalloc,Thread 2 stack,Thread 4 stack,Processor As Memory,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Thread 1 stack,Code,Globals,Heap,0x00000
39、000,0xFFFFFFFF,Processor Bs Memory,Migrate Thread 3,Thread 3 stack,54,Migrate Stack Data,Isomalloc is a completely automatic solution No changes needed in application or compilers Just like a software shared-memory system, but with proactive paging But has a few limitations Depends on having large q
40、uantities of virtual address space (best on 64-bit) 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine Depends on unportable mmap Which addresses are safe? (We must guess!) What about Windows? Or Blue Gene?,55,Aliasing Stack Data,Processor As Memory,Code,Globals,He
41、ap,0x00000000,0xFFFFFFFF,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Processor Bs Memory,Thread 2 stack,Thread 3 stack,56,Aliasing Stack Data: Run Thread 2,Thread 2 stack,Processor As Memory,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Processor Bs Memory,Thread 2 stac
42、k,Thread 3 stack,Execution Copy,57,Aliasing Stack Data,Processor As Memory,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Processor Bs Memory,Thread 2 stack,Thread 3 stack,58,Aliasing Stack Data: Run Thread 3,Thread 3 stack,Processor As Memory,Code,Globals,Heap,0x000
43、00000,0xFFFFFFFF,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Processor Bs Memory,Thread 2 stack,Thread 3 stack,Execution Copy,59,Aliasing Stack Data,Processor As Memory,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Processor Bs Memory,Thread 2 stack,Thread 3 stack,Threa
44、d 3 stack,Migrate Thread 3,60,Aliasing Stack Data,Processor As Memory,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Processor Bs Memory,Thread 2 stack,Thread 3 stack,61,Aliasing Stack Data,Processor As Memory,Code,Globals,Heap,0x00000000,0xFFFFFFFF,Code,Globals,Heap
45、,0x00000000,0xFFFFFFFF,Processor Bs Memory,Thread 2 stack,Thread 3 stack,Execution Copy,Thread 3 stack,62,Aliasing Stack Data,Does not depend on having large quantities of virtual address space Works well on 32-bit machines Requires only one mmapd region at a time Works even on Blue Gene! Downsides:
46、 Thread context switch requires munmap/mmap (3us) Can only have one thread running at a time (so no SMPs!),63,Heap Data,Heap data is any dynamically allocated data C “malloc” and “free” C+ “new” and “delete” F90 “ALLOCATE” and “DEALLOCATE” Arrays and linked data structures are almost always heap dat
47、a,64,Migrate Heap Data,Automatic solution: isomalloc all heap data just like stacks! “-memory isomalloc” link option Overrides malloc/free No new application code needed Same limitations as isomalloc; page allocation granularity (huge!) Manual solution: application moves its heap data Need to be abl
48、e to size message buffer, pack data into message, and unpack on other side “pup” abstraction does all three,65,SDAG,66,Structured Dagger,What is it? A coordination language built on top of Charm+ Motivation Charm+s asynchrony is efficient and reliable, but tough to program Flags, buffering, out-of-o
49、rder receives, etc. Threads are easy to program, but less efficient and less reliable Implementation complexity Porting headaches Want benefits of both!,67,Structured Dagger Constructs,when code Do not continue until method is called Internally generates flags, checks, etc. Does not use threads atomic code Call ordinary sequential C+ code if/else/for/while C-like control flow overlap code1 code2 . Execute code segments in parallel forall “Parallel Do” Like a parameterized overlap,