Practical Game Architecture for Multi-core Systems
By Katrina Archer
Today’s developers create games for a wide array of platforms—from mobile devices and the Web
to sophisticated multi-core processor systems like PCs and consoles such as Microsoft* Xbox 360*
and Sony* PlayStation* 3. When creating games for a single platform with fixed hardware, you
have the luxury of making the best possible use of the hardware configuration. Because of market
realities, many game developers do not possess the resources to build special-purpose game
engines for each platform. Given the rapid advance in PC hardware and the number of years the
Xbox 360* and PlayStation* 3 consoles have been on the market, today’s personal computers often
easily outperform the consoles. However, with a bit of ingenuity it’s possible to design a flexible
game architecture that performs well on multiple, different platforms.
Is Memory Architecture Important?
Personal computer systems as well as Xbox 360* and PlayStation* 3 consoles offer the power of
multi-core processor systems. These days, a PC can possess an arbitrary number of processors,
each with multiple cores. Three cores sit inside an Xbox 360* console, whereas the PlayStation* 3
has a single main core but eight synergistic processing elements (SPEs), of which six are usable by
developers.
Both PCs and the Xbox 360* have a Unified Memory Architecture (UMA). Conversely, the
PlayStation* 3 has a Non-unified Memory Architecture (NUMA). Each SPE in the PlayStation* 3
comes with an synergistic processing unit (SPU), 256 KB of local memory storage—all manually
managed—and no conventional memory cache. Data must be transferred using direct memory
access (DMA) and pulled from main memory into local memory on the SPE before any processing
occurs on the SPU.
Optimal Cross-platform Coding Considerations
UMA-style development does not work well on a NUMA system: The algorithms are not efficient.
On the PlayStation* 3, incorrectly structured data leads to wasted cycles trying to move data back
and forth between main memory and the SPEs. If more than one SPE needs access to the same
data set in main memory, the risk of deadlocks increases, resulting in a need for complicated
scheduling algorithms. The size of the data set becomes important: Keeping data in small chunks
avoids memory access deadlocks.
An algorithm optimized for processor speed that runs quickly on UMA but ignores data movement
costs results in wasted cycles on a NUMA system. Random access of memory consumes too many
cycles performing data transfers to local memory. Happily, discretizing data for optimal NUMA 2 Practical Game Architecture for Multi-core Systems
20 December 2010
DMA transfers produces better locality and boosts performance on a UMA platform because of
greater cache coherence.
Because of the limitations of the PlayStation* 3 design, developers writing games typically find
that it’s better to write code for the PlayStation* 3 and adapt it for the Xbox 360* and the PC. Doing
so involves using similar algorithms and data structures on all platforms, then on the Xbox 360*
and PC conditionally compiling out the specialized DMA code that performs the data transfer for
the PlayStation* 3.
Coarse- and Fine-grained Parallelism
Different games require different basic architectures. Many single-player games and even nonnetworked multiplayer games do a certain amount of work in a given period of time—either a
frame of rendering time or another convenient time step. Some threads, such as the renderer, may
run asynchronously. The game reads the controller input, updates the player-controlled
characters, artificial intelligence (AI) characters, and the game’s overall state machine, and then
sends the results to the renderer.
A coder can use the different processors and their cores, assigning various game systems to run in
parallel on different threads. The key is to keep each core or SPE as busy as possible while
minimizing bottlenecks and idle time. A networked game, as a result of latency and deterministic
issues, uses different techniques beyond the scope of this article.
Coarse-grained Parallelism
Coarse-grained parallelism divvies up whole game systems at the thread level. Larger systems are
assigned to their own thread. On the Intel® Core™ i7 and Xbox 360*, each core offers two
hardware threads. The CPU maintains two contexts or register sets—one for each hardware
thread—and swaps quickly between each thread as needed.
One way of breaking down game systems puts all simulation work—controller, game state and
logic, and AI—in one thread, while rendering and audio are each in their own threads. For
example, on the Xbox 360*, simulation, rendering, and audio each get their own hardware thread
on a separate core (Figure 1).Practical Game Architecture for Multi-core Systems 3
20 December 2010
Thread 1 Thread 2 Thread 1 Thread 2 Thread 1 Thread 2
Core 1 Core 2 Core 3
Simulation Rendering Audio
Fine-grained job queue
Figure 1. Xbox 360* thread layout
The PlayStation* 3 requires a more hybrid approach (Figure 2). You can use the two hardware
threads on the single main core, and then allocate threads to each of the SPEs. One option puts the
simulation thread on one of the main core threads, with rendering on the other. However, doing so
causes the renderer to compete with the simulation thread, because unlike on Xbox 360*, they are
running on the same core. Because games are often render-bound, it helps to allocate an entire
SPE to aiding the main rendering thread by pipelining and sending work out to the graphics
processing unit (GPU) on jobs that require heavy communication with the GPU.
Figure 2. PlayStation* 3 thread layout
Thread 1 Thread 2 SPE 1
Main Core
Simulation Rendering Audio Fine-grained job queue
SPE 2 SPE 3 SPE 4 SPE 5 SPE 64 Practical Game Architecture for Multi-core Systems
20 December 2010
Audio is an important component to parallelize, because neither the Xbox 360* nor the
PlayStation* 3 has much dedicated audio. In addition, unlike simulation and rendering, where the
user experience degrades linearly with loss of performance, audio degrades all at once, usually in
the form of clicks and pops. Therefore, on Xbox 360, the audio thread is often assigned to its own
core; audio has its own SPE on the PlayStation* 3.
Fine-grained Parallelism
Fine-grained parallelism breaks down work at the job or task level. A batching system manages the
distribution and queuing of jobs or tasks to different processors.
On the PC or the Xbox 360*, jobs are batched to the unused hardware threads on each available
core (Figure 3). A developer who uses three threads for the larger systems described above leaves
three secondary hardware threads on the Xbox 360* for fine-grained job queues. On the PC, where
you allocate jobs depends on how many cores you have at your disposal. On the PlayStation 3, the
jobs are farmed out to one of the remaining unused SPEs.
Thread 1 Thread 2
Core 1
Fine-grained job queue
Simulation
Thread 1 Thread 2
Core 2
Rendering
Thread 1 Thread 2
Core 3
Audio
Thread 1 Thread 2
Core 4
Thread 1 Thread 2
Core N
Figure 3. PC thread layout
Good tasks to parallelize in this manner are physics, visibility culling, particle effects, and
animation.
With different high-level systems (e.g., simulation, rendering) creating finer-grained job batches
(e.g., physics, culling) to put into the queue, a savvy coder can facilitate load balancing by
designing the size of each batch to take roughly the same amount of computation time. In practice,
this requires extensive profiling and tuning to eliminate bottlenecks.
A good batching heuristic normalizes the cost of each job based on the operational expense for
each primitive type. For example, in physics collision tests, sphere calculations are cheap,
cylinders are more expensive, and convex hulls require the most processing. The batch size sent to
the fine-grained thread changes depending on the type of primitive: The heuristic restricts how
many convex hulls appear in a single batch. Farming out smaller-sized but more expensive packets Practical Game Architecture for Multi-core Systems 5
20 December 2010
allows each processor more opportunities to gather work. Packets with wildly varying execution
times cause starvation periods. With normalization, no one high-level system waits very long for
another system’s job to finish, resulting in higher overall throughput.
Physics’ computationally intensive calculations involve integrating over constraints, doing
collision detection, and then resolving interpenetrations of objects. Often, you must iterate over a
number of game objects to determine whether object A intersects object B. The first parallelization
step assembles the data for an arbitrary number of game objects into packets of collision queries.
The job queue then sends the filled packets off for collision detection processing on the finegrained thread.
Visibility and occlusion culling offload well to a separate thread. In the case of the PlayStation* 3, a
brute-force linear culling algorithm caters more to the strengths of the architecture than more
sophisticated hierarchical techniques such as binary space partitioning (BSP) trees, octrees,
spatial partitioning, portals, or clustering. Tree algorithms are cache-bound and have poor data
locality. A simple linear box-to-box visibility test takes as few as five cycles on a single instruction,
multiple data (SIMD) processor like the SPU. Packaging hundreds of boxes together allows very
quick culling of thousands of objects. This type of test is also stateless and doesn’t require knowing
the previous positions of any other object. From frame to frame, the cost remains consistent,
unlike the hierarchical algorithms, which require tree rebalancing or pruning of the data set.
Animation jobs, batched on a per-character basis, also parallelize well. Types of operations to
throw onto the queue include inverse kinematics (IK), head-tracking, skeletal matrix evaluation,
and pose blending. Asynchronous evaluation of particle systems is also a perfect task for another
thread.
Discussion
It would be easy to conclude that the PlayStation* 3, with one main core and six usable SPEs, offers
a significant advantage in performance over the Xbox 360* system’s three main cores. However,
with careful tuning, the coarse-grained primary thread work runs faster on the Xbox 360* because
of the general-purpose nature of each core. On the PlayStation* 3, the coarse-grained threads’
performance on the main CPU suffers because of core contention.
Given the rapid advance in PC hardware and the number of years the Xbox 360* and
PlayStation* 3 consoles have been on the market, today’s personal computers often easily
outperform the consoles. The overall performance on a PC depends on the total number of cores in
the system and how well the game is threaded.6 Practical Game Architecture for Multi-core Systems
20 December 2010
Why Use a Mixed Coarse/Fine Approach Instead of a Completely Distributed
Approach?
It may seem tempting to develop a game system using a completely distributed approach by
breaking down each task or job into similarly sized chunks of work: Certainly, it looks simple on
the surface and algorithmically elegant. A fine-grained system is infinitely scalable—an advantage
as computer architectures add more and more cores. This is a consideration for developers
creating games to run on personal computers.
As the granularity becomes smaller, more work can be parallelized, providing a wider array of
choices for scheduling and load balancing. Increasing the number of tasks, however, comes with a
cost: more overhead for context switching and synchronization and increased memory usage.
The variety of algorithms, dependencies, CPU, and memory usage patterns in game engines makes
it difficult to design a one-size-fits-all general tasking system. Many operations have a strict order
of dependence. Without complex priority queues, increased parallelism can introduce nondeterminism into the game loop, making the flow of code more difficult to follow—an important
consideration during the quality assurance phase of a product when large numbers of bugs must
be fixed in a few short months.
To produce compelling onscreen characters that behave in interesting ways, the game loop does a
large amount of conditional event evaluation that can look at any random game object. The
intertwined subsystems in AI and simulation access disparate data in a cache-unfriendly manner.
To properly parallelize these operations requires making each and every simulation variable
thread-safe. The theoretical performance gained by parallelizing the AI tasks tends to be offset by
the cache misses and latency induced by the thread mutexes. Little available PlayStation* 3 or
Xbox 360* performance is left “on the table” by keeping the lion’s share of simulation and AI in a
single thread. This is partly why even though rendering performance increased dramatically with
the hardware shift from single processor to multi-core, gamers haven’t seen the same exponential
increase in the number of characters shown on screen for the PlayStation* 3 or Xbox 360*.
The coarse/fine system described earlier scales fairly well, because you can keep adding finegrained parallelism, if needed. For example, on a PC with eight cores instead of two or four cores,
you could offload more particle simulation tasks to an individual core rather than letting the
rendering thread deal with those tasks on its own.
Conclusion
In gaming, it’s important to architect a flexible system, because tasks wind up taking different
amounts of time depending on what the player does at any given moment. When large explosions
fill the screen with particle effects, the game becomes rendering-bound. But when more characters
spawn to attack the player or during large crowds scenes, the game becomes simulation-bound.Practical Game Architecture for Multi-core Systems 7
20 December 2010
In the mixed approach described earlier, a well-designed queuing system determines which
coarse-grained thread needs to offload more fine-grained tasks at any given time. When combined
with accurate profiling results from tools like Intel® Graphics Performance Analyzers, this offers a
developer greater choice in making the best use of each core to give the player the most
immersive experience possible.
For More Information
“On Processors, Cores and Hardware Threads” at http://software.intel.com/enus/blogs/2008/10/29/on-processors-cores-and-hardware-threads
“Application-customized CPU design: The Microsoft Xbox 360 CPU story” at
http://www.ibm.com/developerworks/power/library/pa-fpfxbox
“The Cell Broadband Engine: Exploiting multiple levels of parallelism in a chip
multiprocessor,” International Journal of Parallel Programming, June 2007 at
http://domino.research.ibm.com/library/cyberdig.nsf/papers/1B2480A9DBF5B9538525
723D0051A8C1/$File/rc24128.pdf
No comments:
Post a Comment