Threading Full Auto Physics.ppt_麦多课文库mydoc123.com

资源描述

1、Threading Full Auto Physics,David Wu Director of Technology Pseudo Interactive,Agenda,About Full Auto About Pseudo Review of Threading Techniques Full Auto Threading Strategy Full Auto Threading Reality What went right, what went wrong Random Notes,Full Auto,Released on Feb 14 2006 for the Xbox 360,

2、Full Auto,Design Vision: “The Most destructive racing game ever” My Vision: “Physics running on multiple cores” Which I later updated to: “Physics running on multiple cores that has few race conditions and doesnt crash too often” Marketing preferred the Design Vision.,About Pseudo Interactive,Pseudo

3、 Interactive is a team of great game developers.,Types of Parallelism,Pipelined Stages Discrete stages that feed into one another I.e. Physics - Render,Pipeline,Similar to CPU/GPU parallelism,T=4,T=3,CPU Game Frame 3,GPU Render Frame 2,CPU Game Frame 4,GPU Render Frame 3,T=5,CPU Game Frame 5,GPU Ren

4、der Frame 4,T=4,T=3,Pipeline,Thread 0 collision detection Frame 3,Thread 1 Logic/AI Frame 2,Thread 2 Integration Frame 1,Thread 0 collision detection Frame 4,Thread 1 Logic/AI Frame 3,Thread 2 Integration Frame 2,T=5,Thread 0 collision detection Frame 5,Thread 1 Logic/AI Frame 4,Thread 2 Integration

5、 Frame 3,Types of Parallelism,Pipelined Stages Discrete stages that feed into one another I.e. Physics - Render Forking Divide data and solve on multiple threads at once I.e. Collision detection on independent objects,Forking Data Parallelism,Perform the same task on multiple independent objects in

6、parallel. Thread “forks” into multiple threads across multiple processors All threads repeatedly grab pending objects indiscriminately and execute the task on them When finished, threads combine back into the original thread.,Forking,Object A Thread 2,Object B Thread 0,Object C Thread 1,Fork,combine

7、,5-Mar-05,Types of Parallelism,Pipelined Stages Discrete stages that feed into one another I.e. Physics - Render Forking Divide data and solve on multiple threads at once I.e. Collision detection on independent objects Jobs Spin off tasks to be complete asynchronously Lower priority tasks not in the

8、 critical path I.e. Effects, procedural mesh and texture updates,Jobs,Free Threads,High Priority Thread allocation,Main Game Loop Physics, Ai, Rendering Etc.,Job Task,Job QueueJob Job Job,Synchronous Deterministic Critical Path Latency critical,Asynchronous Non-deterministic Low Priority Latency tol

9、erant,Full Auto X360 Thread Allocation,Rendering,Main Thread Physics Game Logic Animation AI Network etc,Audio,Worker threads,Thread 0 | Thread 1 _Thread 2 | Thread 3 _Thread 4 | Thread 5,Voice Encode/decode,Full Auto Pipeline,Collision Detection,Collect Net data and player input,Game Logic, Animati

10、on, AI,Integration and Constraint Solving,Kick render, audio, Network data,30% CPU time,40% CPU time,30% CPU time,Full Auto Pipeline,Collision Detection,Collect Net data and player input,Game Logic, Animation, AI,Integration and Constraint Solving,Kick render, audio, Network data,30% CPU time,40% CP

11、U time,30% CPU time,Collision Detection,All objects stored in a global kdop tree. Moving objects can collide with moving objects and idle objects idle objects cannot collide with idle objects. Main thread forks into 5 threads (soliciting the worker threads),Collision Detection,Broad Phase,Moving Obj

12、ects,Constraint Manager,Narrow Phase,Callbacks,Register Contacts,KDOP Tree All Objects Read Only,Logic Code,Thread,Thread,Thread,Thread,Pipeline?,Broad Phase (BVH),Narrow Phase (GJK),Callbacks,Register Contacts,This would not improve latency and would require double buffering of data,Forking,Collisi

13、on Detection Tasks,Each moving object is a task Threads pop objects off of the todo list one by one using interlocked access until they are all processed. To avoid duplicate tests, moving objects are only tested vs other moving objects whose memory address is lower,Collision Detection,Object N Broad

14、 Phase,Moving Objects,Constraint Manager,Object N Narrow Phase,Object N Callbacks,Object N Register Contacts,KDOP Tree All Objects Read Only,Logic Code,Race Conditions,There are multiple threads registering contacts simultaneously. To avoid race conditions we need atomic access There are many terms

15、and concepts that will suit our needs: Mutex Critical Section Semaphore Spin Lock,Mutex Rant,Most researchers assert and often prove that locks/mutexes suck I am convinced that this is true But everything else sucks more,Mutex 101,We use: atomic primitives whenever possible. (InterlockedIncrement(),

16、cellAtomicIncrement(), etc) a Mutex variant for everything else. I.e. constraintManagerMutex.Acquire() Register contacts constraintManagerMutex.Release();,Spin lock Mutex.,Acquire() looks like:while(_InterlockedCompareExchange(With slight variations you can have a recursive mutex or a read/write mut

17、ex. System mutexs will set up events and sleep. Useful if you have more software threads than hardware threads,Collision Detection,Object N Broad Phase,Moving Objects,Constraint Manager,Object N Narrow Phase,Object N Callbacks,Object N Register Contacts,KDOP Tree All Objects Read Only,Logic Code,Mut

18、ex,Mutex,Constraint Manager,The constraint manager keeps track of all interacting objects. The connected sub graphs form simulation batches of coupled objects Objects that must be solved together. More on this later,Collision Detection Results,There were enough objects to distribute tasks evenly wit

19、h good load balancing. In the median case with 5 threads across 3 cores we had a 3-4x performance increase as compared to 1 thread on 1 core. The worst case was about 2x, this was due to rare situations in which there were a lot of call-backs that held locks for too long. We wanted to improve this,C

20、ollision Detection Callbacks,We have many logic call-backs that trigger in response to contacts. Due to the quantity of logic code that was not thread safe, we added a lock around all call-backs that were suspicious. This way they could think that that are single threaded. They know that no other th

21、read agnostic logic is running while they are running. Call-backs are not allowed to access global collision data, they use information that is passed to them regarding the specific contact.,Collision Detection,Object N Broad Phase,Moving Objects,Constraint Manager,Object N Narrow Phase,Object N Cal

22、lbacks,Object N Register Contacts,KDOP Tree All Objects Read Only,Less thread friendly logic code,Mutex,Mutex,Thread friendly logic code,Collision Detection Callbacks,I considered adding extra software threads to help reduce the impact of waiting on locks. I.e.: Each thread runs as usual, but when i

23、t needs a lock that is taken, it pushes its current context and starts from the beginning with the next available object. When the lock is available, it goes back to its first object and finishes that before continuing with the second object.,Collision Detection Callbacks,It is generally important t

24、hat you test all hypothesis. Bonus points if you talk about the results at GDC to save other game programmers time. Even if you dont want points you might get a free pass to GDC if your lecture is accepted. That being said, it felt messy and I didnt want a lot of software threads hanging around, so

25、I didnt try it.,Collision Detection Callbacks,In the future we plan to make all call-backs deferred, with no assumptions as to ordering. This will be needed for PS3, where collision detection will not occur on a general purpose CPU. I anticipate a lot of debugging and pain.,PPU,PPU,SPU,Collision Det

26、ection PS3,SPU,Object N Broad Phase,Moving Objects,Object N Narrow Phase,Callback Queue,Contact Queue,KDOP Tree All Objects Read Only,Logic Code,Constraint Manager,Collision Detection Implementation,Surprisingly the implementation was not too difficult and there were few tortuous moments. For brief

27、period, I thought that threading was easy money. I was wrong.,Full Auto Main Thread,Collision Detection,Collect Net data and player input,Game Logic, Animation, AI,Integration and Constraint Solving,Kick render, audio, Network data,30% CPU time,40% CPU time,30% CPU time,Other Threads,Other Threads,O

28、ther Threads,Other Threads,Thread 0,Game Logic First Plan,Game Logic Animation AI,Geo Queries Hull Tracing Line of sight Sensors,Kinematics,Game Logic is single threaded. Engine operations fork across cpus to speed up expensive operations,Game Logic First Plan,Rationale: Threading is hidden in engin

29、e code Game logic code does not need to know about it. Game logic code is more volatile (more changes) and there is more of it, so thread safety is more costly to implement. We were hoping that most of the time was spent in Engine functions that could be paralyzed well.,Game Logic First Plan Results

30、,Not much of a change. In Full Auto, the majority of time was spent outside of engine code. This is likely genre specific, i.e. in some games a lot more time is spent in line of sight checks .,All Threads,All Threads,All Threads,All Threads,Thread 0,Game Logic Second Plan,Less Thread Friendly Game L

31、ogic,Geo Queries Hull Tracing Line of sight Sensors,Kinematics,Thread Friendly Game Logic 1 functional per thread,Main thread forks Each thread friendly game logic “functional” executes in any thread. Once these are done, the threads merge and thread 0 executes the less thread friendly game logic in

32、 a single threaded manner.,Game Logic Second Plan,Initially all logic was “less thread friendly” We made functionals thread friendly one by one based on: How much CPU time they used How much programmer time it would take to make them thread friendly,Game Logic Second Plan,The functionals that we con

33、verted include: Car control system Animation All actuators, constraints, potentials (ropes, springs, etc) Particle Emitters Sound Emitters Sensors Some AI,Game Logic Second Plan,Some that we wanted to convert, but that didnt make it: Some AI Race Logic Weapons As expected, these became performance h

34、otspots Amdahls Law Sucks,Full Auto Main Thread,Collision Detection,Collect Net data and player input,Game Logic, Animation, AI,Integration and Constraint Solving,Kick render, audio, Network data,30% CPU time,40% CPU time,30% CPU time,Integration,Integration Batches,Each batch contains one or more o

35、bjects who are physically interacting,Integrate and solve constraints,Update BVH, dependant kinematics, VSD,Write state to record buffer,Select batch with no outstanding dependencies,All Threads,All Threads,All Threads,Integration Forked,Integration Batches,Integrate and solve constraints,Update BVH

36、, dependant kinematics, VSD,Write state to record buffer,Select batch with no outstanding dependencies,Record Buffer,atomicInc,Mutex,BVH,VSD,Mutex,Mutex,Integration Results,3-4x speedup in all tests using 5 threads across 3 cores.,Damage,Damage Processing took a lot of time. We spun off all non crit

37、ical-path processing into jobs. I.e. Modifying meshes to reflect FEM damage Adding decals Building meshes for chunks from their source geo.,Damage,Jobs are executed only when threads are not busy doing other things In high stress situations, it takes several frames for damage effects to display.,Fre

38、e Threads,High Priority Thread allocation,Main Game Loop Physics, Ai, Rendering Etc.,Job Task,Job QueueJob Job Job,Synchronous Deterministic Critical Path Latency critical,Asynchronous Non-deterministic Low Priority Latency tolerant,Rendering,Rendering executed in its own thread which ran in paralle

39、l with the main game loop It only used one thread. We were unable to find a way to fork anything significant given the nature of the API and our timeline. The API is better now.,What Went Right?,Threaded Physics Early Used all processors and threads Performance actually improved We debugged just abo

40、ut all of it before we shipped.,What Went Wrong,Not enough polish time There were some easy money perf gains that we missed I am a fan of consistant performance. Lots of the time we run at 60, but some times things get slow, and I have no idea why. Not enough time on CPU rendering perf. We spend mor

41、e time in DX code than in physics Too much debugging. Many of the bugs were rogue interactions between the renderer and the game code.,Lessons Learned,Threading is challenging but surprisingly fun. Given enough time, there is a great deal of optimization that can be done. All consoles are like PS2 i

42、n that compared with PC and Xbox1, optimization actually makes a difference.,Future,How will processors evolve going forward? Symmetric Processors (I.e. X360, current PC) Or Asymmetric Processors: I.e. PC with Physics chip or GPU CELL Big Vectors or small vectors? (SIMD),Future,On X360 vs CELL we ar

43、e able to thread a lot more stuff, more easily. CELL has a lot of potential, but more time is required to implement systems and update. Fortunately Physics can adapt to most of these paradigms,Physics level of detail,This is the new holy grail. Due to the nature of the problems we often have times o

44、f very high stress followed by periods of low stress. Some tasks can be deferred and updated by threads asynchronously with greater latency. Some tasks can be dropped, with detail increased only when possible.,Amdahls Law,Amdahls Law sucks more on multi core systems than it does on single core syste

45、ms. Single threaded slow code is many times slower than than fully threaded fast code. Due to the nature of our problems and schedules, we cannot thread everything. As processor counts increase, single threaded parts become more and more significant. Collision Detection and Integration are fast in F

46、ull Auto. However other parts in between and single threaded logic have become the bottlenecks.,Amdahls Law,Solutions (Wishful) Faster single threaded processors New Algorithms More time for programmers Help from the Compiler Solution (Practical) Change the design such that we only do things that ru

47、n fast.,Note on Low level Parallelism,Instruction Parallelism SIMD, Multiple elements in a vector Pipeline parallelism Keep all execution units active Memory Parallelism Keep memory bus active through prefetching and pipelined writes. Ideally, Compilers would handle these But they dont. Often these

48、can be high bang/buck items on in order cores with curiously high latency,Pipeline: Notes,I am not a fan of pipelined stages Whenever possible we used forking and jobs. Presently, Full Auto only uses pipelined stages for rendering vs the rest of the game. We spent too much time debugging race condit

49、ions between game logic and rendering data. In the future, when the graphics API is more thread friendly we will fork our rendering.,Add Hoc Threading 101,Some of our threading was heavily planned, The rest of it was less so We used a data centric approach with the following ad hoc heuristics,Add Ho

50、c Threading 101,Data centric approach If there is only one writer, and data is always safe to read (I.e. a resizing array isnt) then you can let it go Data with multiple writers but no ordering dependencies have to use atomic access. We generally use a Mutex here, or when possible, interlocked primitives Sometimes there are too many readers for a mutex to work well in which case we queue up the operation to be executed with single threaded logic. Sometimes a block of code is too big, or we are too lazy to analyze, so we queue that up too.,

展开阅读全文