XBOX 360 CPU Details Described at MPR Fall Processor Forum
I was the Chief Engineer for the development of the XBOX 360 CPU chip and Tuesday, October 25 I presented details about the chip and how we developed it to Microprocessor Report’s Fall Processor Forum. Here’s some of what I presented.
XBOX 360 CPU Project
Back in November of 2003 Microsoft and IBM made the following announcement.
REDMOND, Wash., and EAST FISHKILL, N.Y. — Nov. 3, 2003 — Microsoft Corp. today announced that it has entered into a semiconductor technology agreement with IBM Corp. Under the agreement, Microsoft has licensed leading-edge semiconductor processor technology from IBM for use in future Xbox® products and services to be announced at a later date.
That Announcement was really about the development and manufacturing of the CPU chip for the XBOX360 Console.
Microsoft’s engagement with IBM was through IBM’s Engineering and Technology Services Division. As you may be aware IBM’s Services offerings are a significant portion of IBM’s total revenue. E&TS is one of the newest parts of that business.
Through E&TS, Microsoft was able to take advantage of the significant investment in Research and Development that IBM makes for system design, component design, and semiconductor process technology.
Microsoft brought their gaming domain knowledge and software development experience and worked with E&TS to apply our capabilities to the task of developing a custom processor solution. We built off IBM’s extensive portfolio of established designs and research projects to develop a personalized system architecture to match the XBOX 360 system vision and engineered it use in consumer product.
Chip Overview
The CPU chip contains a 3-way symmetric multi-processor running at 3.2 Ghz.
The 3 processors share a 1 MB L2 cache and a Front side bus which connects the CPU chip to the ATI graphics chip. The Front Side Bus has a peak bandwidth of 21.6 G Byte / sec.
The chip also includes a significant portion of support logic that provides test, Power On Reset control, and debug and trace functions. There are eFuses used for array redundancy to improve manufacturing yield. We also used efuses for configurable voltage control, and parametric adjustment in the analog units.
The chip IOs provide the following:
• Front Side Bus
• Debug access to trace array data, performance monitor counters, and critical control and timing information,
• JTAG,
• PowerOn Status condition codes,
• the voltage identifier for the variable voltage regulator which supplies the CPU chip,
• EEPROM attach to hold configuration control data if it turned out to be necessary.
Power PC Core
At 3.2 GHz this is the highest frequency Power PC architecture core IBM is shipping.
The cpu core is a dual issue in order execution micro-architecture with simultaneous multi-threading and support facilities for 2 threads. Because dynamic power consumption is key we implemented extensive clock gating to shutdown pipelines until instructions are active.
The L1 icache is a 32K Byte cache with parity error checking. It is 2-way set associative cache with 128B lines. 1st level translation for instruction addresses is done using a 64 entry 2-way set associative effective to real address translation cache
The 2 issued instructions can go to one of 5 execution pipes: Branch which is really part of the Instruction unit, Load/Store , Fixed Point, Floating Point and VMX. Difficult instructions are implemented via microcode. At dispatch they are cracked and converted into multiple micro-ops.
The branch unit includes a 4K Byte - 2 way set-associative Branch History Table per thread.
The Fixed Point Pipe actually has two units. One to handle the Simple operations like (add/sub, cmp, logical ops, and rotate). The other handles the Complex ops like Multiply/Divide.
The Load/Store pipe handles access to the L1 Data cache and the storage hierarchy.
Like the L1 Icache the L1 Dcache is a 32KByte cached with parity error checking. However, it is 4-way set associative. It is store through and provides Non-blocking access so a cache miss does not hold up a subsequent hit.
1st level Data address translation is handled by a 64 entry 2-way associative ERAT. 2nd level translation for both data and instructions is handled by a 1K 4-way associative TLB which can be software as well as hardware managed.
VMX 128
We developed a Microsoft unique implementation of VMX called VMX128 which focused on improving graphics, game physics, and artificial intelligence.
Power management within the FPU / VMX128 units is especially valuable as it is rare that all three cores would be running threads with active numeric computation.
We implemented a Delayed Execution Issue Queue which reduces the effective load latency to 2 cycle vs 8-10 cycles without it. There are separate load target buffers for the FPU and VMX128 units that essentially enables Out of Order FP/VMX execution relative to Loads and Stores
We made a number of architectural changes to the VMX unit when we created VMX128. We extended the number of Vector Registers from 32 to 128. All 128 Registers are directly-addressable and the original 32 Registers are mapped to the first 32 entries of 128-entry vector register file. We also added a number of instructions:
• floating-point dot-product instructions supporting 3-vectors and 4-vectors
• Permute-class instructions for rotate and insert operations
• Pack / unpack instructions for converting Direct3D data types to/from single-precision FP format
• storage access instructions to improve access to misaligned data
Finally we maintained binary compatibility with a subset of the original PowerPC ISA
Shared L2
The shared 1MB L2 which supports the three CPUs is split into two portions. One part connects the CPUs with the different dataflow queues and runs at the processor frequency of 3.2GHz. The rest the L2 including the data arrays and the directory run at ½ the processor frequency.
Commands from the 3 cores are queued and then arbitrated into a L2 directory control unit for processing. In order to improve caching performance there are two copies of the directory, allowing simultaneous core access and IO snoops. The directories have parity based error detection.
Cacheable and Cache Inhibited store operations are processed through different pipelines. The cacheable store pipe includes 8 store gathering buffers per core. These 8 line buffers are non-sequential to improve performance. The non-cacheable store pipe includes 4 store gathering buffers per core. These 4 line buffers are sequential and simplify ordering for non-cacheable ops.
The L2 Data array includes Single Bit Correct / Double Bit Detect ECC.
Included in the L2 Cache architecture are several features to support high bandwidth data stream. To improve read streaming bandwidth we focused on two things.
1. We added an Extended Data Cache Block Touch instruction which allows a data prefetch to bypass the L2 and go directly into the L1. This significantly reduces the L2 thrashing which can be an issue for prefetching with smaller L2s.
2. We also implemented an aggressive hit under miss capability in the core so that each core can have up to 8 loads outstanding.
To improve write streaming bandwidth we focused on three features:
1. Within the Core the L1s are write through so writes do not allocate a line into the L1.
2. Within the L2 which is 8-way set associative we provide a configurable L2 set locking capability that ensures that streaming though the locked set does not thrash the rest of the cache,
3. Finally to support procedural geometry, modified data within the L2 can be read by the GPU without forcing a store to memory which could cause a change of ownership or eviction of the line. One of the key design objectives was high sustained bandwidth for this GPU read operation
Front Side Bus
The Front Side Bus Architecture developed by IBM was fully customized for the XBOX 360 gaming platform in order to meet throughput and functional requirements. The Link Architecture utilized a specialized packet structure with automatic hardware managed flow control, error recovery, link training, and link management.
IBM took an end-to-end approach to the Front Side Bus architecture and development. This includes design, verification, and test owned by IBM with half of the link existing within ATI’s GPU. In fact, a common VHDL description, designed by IBM, is instantiated in the two chips even though both chips are built with very different methodology, technology, frequency, and data widths.
The transaction layer provides a common functional interface to the two chips. It manages the Link Layer protocol for reliable packet delivery. It also performs command reordering and manages the two virtual channels. The two virtual channels are used for request and response and were architected primarily for deadlock avoidance but they also allow configurable performance by setting channel priority.
The Link Layer provides link training, error detection and retransmission, as well as flow control. We architected a beefed up soft error recovery mechanism to support the use of lower cost manufacturing components. In addition, because the memory containing the boot program resides across the link at the GPU or below, the link initialization must be bullet proof without SW intervention.
The front side bus physical layer is structured as two unidirectional links capable of transmitting 10.8 Gbyte/sec. Each link is made of two single byte lanes. Each lane has one clock. The links are source synchronous so that the receive clock is sent with the data.
The most demanding portions of the PHY design are the analog transmitters and receivers. The analog components are implemented using Current Mode Logic which supports the very low jitter and high noise tolerance required.
Termination on the link to improve link signaling quality is controlled dynamically at link training. Low tolerance resisters are dynamically switched in and out to adjust the termination to 50ohms.
The physical link specification included the receiver and transmitter performance, the chip package, and the board parametrics, layout, and wiring constraints. The specification was created by IBM and used by Microsoft to design the system board.
CPU Chip Package
The design of the CPU package presented a significant challenge due to the combination of the power environment, the high frequency operation of the CPU, the Front Side Bus frequency, and high volume low cost system card and chip package goals.
The custom package design is a 2-2-2 Flip Chip PBGA which is 31mm by 31mm and supports the 2s,2p system card.
In order to operate the Front Side Bus reliably the package had to support aggressive targets for differential signal attenuation between the package ball and the c4, loss due to reflection within the package, and cross talk between adjacent signal pairs
The package was designed to provide power distribution to the CPU chip with no greater than 80mV droop at the circuit.
In the end the PHY design, FSB architecture, the link specification, and the package design all worked together to close on a solution for the system that could be manufactured.
Test and Debug Features
No serious CPU chip of this complexity would be complete without comprehensive test and debug features.
The XBOX 360 CPU includes support logic for Array and Logic Build in Selftest. AC BIST operates at full functional frequency which allows for maximum defect coverage including marginally slow circuits. The Analog PHY is functionally tested by an internal wrap test called PING BIST. This test also operates at full PHY frequency.
The chip includes internal trace arrays which allow 1000’s of key internal signals to be traced. Extensive pattern matching for trigger conditions provide an extremely useful logic analyzer. The external debug bus provides a way to collect extended traces beyond what can be held within the on board trace arrays.
Finally to support performance tuning of the gaming applications and environment we implemented a set of performance counters which can be set to collect event counts. The set of performance events which number in the hundreds was defined during chip development to support Microsoft’s performance team.
All three of these key features were utilized during our accelerated hardware and system bring-up and validation. In fact, they were one of the key enablers for success.
Another key enabler was the extensive verification effort prior to the release of design data to manufacturing.
Verification and Bringup
Success on this program required that the CPU Chip be right the first time. Pass 1 hardware needed to be fully functional. That means the front side bus had to run at 5.4GHz, the CPU had to run at 3.2 GHz, and caches had to be enabled and operational.
One week after the CPU powered on in a bring-up system a demo game was running with full chip functionality. So how did we do it?
Our strategy was pretty simple.
• Do as much as we can in parallel to make the most progress on a short schedule
• Structure things hierarchically so bugs are found in environments where they can be diagnosed and fixed the quickest.
We took advantage of different methodologies at the unit & subsystems level and then created a unified methodology at chip & system levels.
At each level we had quality measurement standards based on coverage, test suite, simulation cycles, and bug rate and they provided a bottom-up events coverage view and also a top-down architecture view of the coverage.
We held extensive reviews, leveraging the experience and knowledge of IBM corporate verification experts. We even went so far as to take the advice of the reviewers and made changes to our plans, our staffing, and our tools!
We brought to bear the best of IBM’s knowledge on verification by tapping into our Research and Development teams. This included using established uni and multi-processor test suites and intelligent randomized test generation. We also used formal verification tools and methods to prove key parts of the chip architecture where correct.
We validated many system level operations via co-simulation between Microsoft and IBM. This way we were confident the system power on reset, system level coherency, boot ROM code and key parts of the kernal would operate correctly during bring-up.
Bring-up itself was done in three locations where each focused on different parts of the overall tasks. In many ways it was a truly joint effort with engineers from multiple companies working together and bringing the best tools to solve the problems.
Success
We developed an XBOX 360 unique CPU chip, engineered and optimized for the specific product constraints. We went from 1st silicon to volume production in 8 months and if you get in line early enough you might be able to buy one November 22.


cool i was wonder if your cpu
had instructions to compute polygons or
was that where the micro code came in
For a game console, power comsumption is key, even more so than
for regular PCs.
Can you tell us at what voltage the chip runs?
The cpu does not have instructions that are that specific. It has the “standard” PowerPC Instruction set plus some extensions in the VMX128 for example dot-product instructions. The microcoded ops I was refering to are really the “complex” PowerPC ops.
The chip has a variable voltage supply so that each chip operates at a different voltage. The chips voltage is determined at manufacturing to ensure operation to spec and keep the power as low as possible. I won’t be any more specific about the voltage range or power targets as those are Microsoft system issues.
I couldn’t find the spec for the VMX-128 ISA. I am assuming that to preserve binary compatibility with the original VMX, you had to overload the lower 5 bits of conventional VMX instruction strings in order to add the additional 8 bits to address the increased register file (2 extra bits/register, so 2*4 = 8). But the 5 lower bits is clearly not enough. The other 3 bits had to come from the upper portion of the instruction format. This, of course, implies that you had to throw out certain already existing VMX instructions to “borrow” those bits - compatible with your statements so far. What isn’t clear to me is which instructions *exactly* did you have to throw out?
Your response would be much appreciated. Thanks!
excuse me, my bad: [0-5] it’s not 5 bits but 6. So we’re talking about 2 “borrowed” bits here. Sorry for confusion.
Which instructions did you have to throw out of the conventional VMX ISA to accommodate for the increased register file and the new instructions in VMX-128?
how will the processor compare on the new wii console due to be released. Is the processing speed/power likely to change drastically or will it be software that makes the largest contribution to performance?
I am intrigued by the bus between the processors - I have been designing processor boards for many years and I have never seen this. Its 24 pairs of differential lines. If that’s the communication between the two I am impressed. Does anyone know what this bus is called?
Tia,
Les
So I was off work and surfing and found this place and thought I would join up. I don’t have much more to say right now except I need to start reading some of the older posts to get up to speed before I can start posting.
Em