Digital Foundry vs. the Xbox One architects

"There's a lot of misinformation out there and a lot of people who don't get it. We're actually extremely proud of our design."

Feature by Richard Leadbetter Technology Editor, Digital Foundry

Updated on 24 Sep 2013

633 comments

Two months away from the release of the next generation consoles, many have already made up their minds about which machine offers more gaming power before a single game has been released. Compare basic graphics and memory bandwidth specs side-by-side and it looks like a wash - PlayStation 4 comprehensively bests Xbox One to such a degree that sensible discussion of the respective merits of both consoles seems impossible. They're using the same core AMD technologies, only Sony has faster memory and a much larger graphics chip. But is it really that simple?

In the wake of stories from unnamed sources suggesting that PS4 has a significant advantage over its Xbox counterpart, Microsoft wanted to set the record straight. Last Tuesday, Digital Foundry dialled into a conference call to talk with two key technical personnel behind the Xbox One project - passionate engineers who wanted the opportunity to put their story across in a deep-dive technical discussion where all the controversies could be addressed. Within moments of the conversation starting, it quickly became clear that balance would be the theme.

"For designing a good, well-balanced console you really need to be considering all the aspects of software and hardware. It's really about combining the two to achieve a good balance in terms of performance," says Microsoft technical fellow Andrew Goossen.

"We're actually very pleased to have the opportunity to talk with you about the design. There's a lot of misinformation out there and a lot of people who don't get it - we're actually extremely proud of our design. We think we have very good balance, very good performance, we have a product which can handle things other than just raw ALU [GPU compute power]. There are also quite a number of other design aspects and requirements that we put in around things like latency, steady frame-rates and that the titles aren't interrupted by the system and other things like that. You'll see this very much as a pervasive ongoing theme in our system design."

Xbox One: additional processors and the audio block

Microsoft's recent Hot Chips 25 presentation on the Xbox One processor suggested that the chip had 15 processors on-board. We were curious as to how that broke down.

"On the SoC, there are many parallel engines - some of those are more like CPU cores or DSP cores. How we count to fifteen: [we have] eight inside the audio block, four move engines, one video encode, one video decode and one video compositor/resizer," says Nick Baker.

"The audio block is completely unique. That was designed by us in-house. It's based on four tensilica DSP cores and several programmable processing engines. We break it up as one core running control, two cores running a lot of vector code for speech and one for general purpose DSP. We couple that with sample rate conversion, filtering, mixing, equalisation, dynamic range compensation then also the XMA audio block. The goal was to run 512 simultaneous voices for game audio as well as being able to do speech pre-processing for Kinect."

But to what extent will this hardware actually see utilisation, especially in cross-platform games?

"So a lot of what we've designed for the system and the system reservation is to offload a lot of the work from the title and onto the system. You have to keep in mind that this is doing a bunch of work that is actually on behalf of the title," says Andrew Goossen.

"We're taking on the voice recognition mode in our system reservations whereas other platforms will have that as code that developers will have to link in and pay out of from their budget. Same thing with Kinect and most of our NUI [Natural User Interface] features are provided free for the games - also the Game DVR."

"Andrew said it pretty well: we really wanted to build a high performance, power-efficient box," adds hardware architecture team manager Nick Baker. "We really wanted to make it relevant to the modern living room. Talking about AV, we're the only ones to put in an AV in and out to make it media hardware that's the centre of your entertainment."

We've seen the Xbox One dash and the media functions are pretty cool, but first and foremost, it's all about the games. It's safe to say that there are two major areas of controversy surrounding the Xbox One design - specifically the areas in which it is considered weaker than the PlayStation 4: the memory set-up and the amount of GPU power on tap. Both systems have 8GB of RAM, but Sony chose 8GB of wide, fast GDDR5 with 176GB/s of peak throughput while Microsoft opted for DDR3, with a maximum rated bandwidth of just 68GB/s - clearly significantly lower. However, this is supplemented by on-chip ESRAM, which tops out at 204GB/s. In theory then, while marshalling and dividing resources between the two memory pools will be a factor, Xbox One clearly has its own approach for ensuring adequate bandwidth across the system.

Until we get our hands on the final hardware, Wired's internal photography of a pre-production Xbox One remains our only look inside the box. Rumour-mongers should note the lack of discrete GPU - there's to be no last minute addition of extra processing hardware. All of Xbox One's major systems are built into the single chip on the right, which is surrounded by the 2133MHz DDR3 modules.

Memory management is one of the most divisive points that separate the two systems. The question must surely be that if GDDR5 is the preferred set-up, why didn't Microsoft choose it? Still cash-rich to the extreme, clearly the firm could afford to pay the premium for GDDR5. We wondered whether it was fair to assume that this higher bandwidth RAM was ruled out very early on in the production process, and if so, why?

"Yeah, I think that's right. In terms of getting the best possible combination of performance, memory size, power, the GDDR5 takes you into a little bit of an uncomfortable place," says Nick Baker. "Having ESRAM costs very little power and has the opportunity to give you very high bandwidth. You can reduce the bandwidth on external memory - that saves a lot of power consumption and the commodity memory is cheaper as well so you can afford more. That's really a driving force behind that... if you want a high memory capacity, relatively low power and a lot of bandwidth there are not too many ways of solving that."

The combined system bandwidth controversy

Baker is keen to tackle the misconception that the team has created a design that cannot access its ESRAM and DDR3 memory pools simultaneously. Critics say that they're adding the available bandwidths together to inflate their figures and that this simply isn't possible in a real-life scenario.

"You can think of the ESRAM and the DDR3 as making up eight total memory controllers, so there are four external memory controllers (which are 64-bit) which go to the DDR3 and then there are four internal memory controllers that are 256-bit that go to the ESRAM. These are all connected via a crossbar and so in fact it will be true that you can go directly, simultaneously to DRAM and ESRAM," he explains.

The controversy surrounding ESRAM has taken the design team very much by surprise. The notion that Xbox One is difficult to work with is perhaps quite hard to swallow for the same team that produced Xbox 360 - by far and away the easier console to develop for, especially so in the early years of the current console generation.

"This controversy is rather surprising to me, especially when you view as ESRAM as the evolution of eDRAM from the Xbox 360. No-one questions on the Xbox 360 whether we can get the eDRAM bandwidth concurrent with the bandwidth coming out of system memory. In fact, the system design required it," explains Andrew Goossen.

"We had to pull over all of our vertex buffers and all of our textures out of system memory concurrent with going on with render targets, colour, depth, stencil buffers that were in eDRAM. Of course with Xbox One we're going with a design where ESRAM has the same natural extension that we had with eDRAM on Xbox 360, to have both going concurrently. It's a nice evolution of the Xbox 360 in that we could clean up a lot of the limitations that we had with the eDRAM.

"The Xbox 360 was the easiest console platform to develop for, it wasn't that hard for our developers to adapt to eDRAM, but there were a number of places where we said, 'gosh, it would sure be nice if an entire render target didn't have to live in eDRAM' and so we fixed that on Xbox One where we have the ability to overflow from ESRAM into DDR3, so the ESRAM is fully integrated into our page tables and so you can kind of mix and match the ESRAM and the DDR memory as you go... From my perspective it's very much an evolution and improvement - a big improvement - over the design we had with the Xbox 360. I'm kind of surprised by all this, quite frankly."

"The Xbox 360 was the easiest console platform to develop for, it wasn't that hard for our developers to adapt to eDRAM... [ESRAM] is very much an evolution and improvement... over the design we had with the Xbox 360."

Caption

Attribution

Indeed, the level of coherence between the ESRAM and the DDR3 memory pools sounds much more flexible than many previously thought. Many believed that the 32MB of ESRAM is a hard limit for render targets - so can developers really "mix and match" as Goossen suggests?

"Oh, absolutely. And you can even make it so that portions of our your render target that have very little overdraw... for example if you're doing a racing game and your sky has very little overdraw, you could stick those sub-sets of your resources into DDR to improve ESRAM utilisation," he says, while also explaining that custom formats have been implemented to get more out of that precious 32MB.

"On the GPU we added some compressed render target formats like our 6e4 [6 bit mantissa and 4 bits exponent per component] and 7e3 HDR float formats [where the 6e4 formats] that were very, very popular on Xbox 360, which instead of doing a 16-bit float per component 64bpp render target, you can do the equivalent with us using 32 bits - so we did a lot of focus on really maximising efficiency and utilisation of that ESRAM."

How ESRAM bandwidth doubled in production hardware

Further scepticism surrounds the sudden leap in ESRAM's bandwidth from an initial 102GB/s to where it is now - 204GB/s. We ran the story first based on a developer leak of a blog post the Microsoft tech team wrote back in April, but sections of "the internet" were not convinced. Critics say that the numbers don't add up. So how did the massive increase in bandwidth come about?

"When we started, we wrote a spec," explains Nick Baker. "Before we really went into any implementation details, we had to give developers something to plan around before we had the silicon, before we even had it running in simulation before tape-out, and said that the minimum bandwidth we want from the ESRAM is 102GB/s. That became 109GB/s [with the GPU speed increase]. In the end, once you get into implementing this, the logic turned out that you could go much higher."

The big revelation was that ESRAM could actually read and write at the same time, a statement that seemingly came out of the blue. Some believed that based on the available information from the leaked whitepapers, this simply wasn't possible.

"There are four 8MB lanes, but it's not a contiguous 8MB chunk of memory within each of those lanes. Each lane, that 8MB is broken down into eight modules. This should address whether you can really have read and write bandwidth in memory simultaneously," says Baker.

How ESRAM bandwidth is calculated

Memory bandwidth for next-gen consoles is clearly a hot topic in tech discussions. GDDR5 is a known technology and its capabilities in terms of throughput are well-known. ESRAM is a different matter entirely, and especially after Microsoft massively revised Xbox One's bandwidth figures upwards, there have been demands for the tech team to show their calculations.

Here, Nick Baker does just that:

"[ESRAM has four memory controllers and each lane] is 256-bit making up a total of 1024 bits and that in each direction. 1024 bits for write will give you a max of 109GB/s and then there's separate read paths again running at peak would give you 109GB/s.

"What is the equivalent bandwidth of the ESRAM if you were doing the same kind of accounting that you do for external memory? With DDR3 you pretty much take the number of bits on the interface, multiply by the speed and that's how you get 68GB/s. That equivalent on ESRAM would be 218GB/s. However just like main memory, it's rare to be able to achieve that over long periods of time so typically an external memory interface you run at 70-80 per cent efficiency.

"The same discussion with ESRAM as well - the 204GB/s number that was presented at Hot Chips is taking known limitations of the logic around the ESRAM into account. You can't sustain writes for absolutely every single cycle. The writes is known to insert a bubble [a dead cycle] occasionally... one out of every eight cycles is a bubble so that's how you get the combined 204GB/s as the raw peak that we can really achieve over the ESRAM. And then if you say what can you achieve out of an application - we've measured about 140-150GB/s for ESRAM.

"That's real code running. That's not some diagnostic or some simulation case or something like that. That is real code that is running at that bandwidth. You can add that to the external memory and say that that probably achieves in similar conditions 50-55GB/s and add those two together you're getting in the order of 200GB/s across the main memory and internally."

So 140GB-150GB is a realistic target and DDR3 bandwidth can really be added on top?

"Yes. That's been measured."

"Yes you can - there are actually a lot more individual blocks that comprise the whole ESRAM so you can talk to those in parallel. Of course if you're hitting the same area over and over and over again, you don't get to spread out your bandwidth and so that's one of the reasons why in real testing you get 140-150GB/s rather than the peak 204GB/s... it's not just four chunks of 8MB memory. It's a lot more complicated than that and depending on how the pattern you get to use those simultaneously. That's what lets you do read and writes simultaneously. You do get to add the read and write bandwidth as well adding the read and write bandwidth on to the main memory. That's just one of the misconceptions we wanted to clean up."

Goossens lays down the bottom line:

"If you're only doing a read you're capped at 109GB/s, if you're only doing a write you're capped at 109GB/s," he says. "To get over that you need to have a mix of the reads and the writes but when you are going to look at the things that are typically in the ESRAM, such as your render targets and your depth buffers, intrinsically they have a lot of read-modified writes going on in the blends and the depth buffer updates. Those are the natural things to stick in the ESRAM and the natural things to take advantage of the concurrent read/writes."

Microsoft's argument seems pretty straightforward then. In theory, Xbox One's circa 200GB/s of "real-life" bandwidth trumps PS4's 176GB/s peak throughput. The question is just to what extent channelling resources through the relatively tiny 32MB of the much faster ESRAM is going to cause issues for developers. Microsoft's point is that game-makers have experience of this already owing to the eDRAM set-up on Xbox 360 - and ESRAM is the natural evolution of the same system.

Cover image for YouTube video — Microsoft says that game performance doesn't scale with the number of compute units you have. We put that theory to the test by comparing a 2GB Radeon 7850 with a 2GB Radeon 7870 XT, both downclocked to 600MHz (to more accurately reflect compute power of the two systems in the initial specs) and with identical memory bandwidth. Across ten tests we found that 50 per cent more compute power actually yielded an average of 24 per cent improvement in game frame-rates.Watch on YouTube

Memory bandwidth is one thing, but graphics capability is clearly another. PlayStation 4 enjoys a clear advantage in terms of on-board GPU compute units - a raw stat that is beyond doubt, and in turn offers a huge boost to PS4's enviable spec sheet. Andrew Goossen first confirms that both Xbox One and PS4 graphics tech is derived from the same AMD "Island" family before addressing the Microsoft console's apparent GPU deficiency in depth.

"Just like our friends we're based on the Sea Islands family. We've made quite a number of changes in different parts of the areas... The biggest thing in terms of the number of compute units, that's been something that's been very easy to focus on. It's like, hey, let's count up the number of CUs, count up the gigaflops and declare the winner based on that. My take on it is that when you buy a graphics card, do you go by the specs or do you actually run some benchmarks?" he says.

"Firstly though, we don't have any games out. You can't see the games. When you see the games you'll be saying, 'what is the performance difference between them'. The games are the benchmarks. We've had the opportunity with the Xbox One to go and check a lot of our balance. The balance is really key to making good performance on a games console. You don't want one of your bottlenecks being the main bottleneck that slows you down."

Tweaking Xbox One balance and performance

Microsoft's approach was to go into production knowing that there'd be some headroom for increasing performance from the final silicon. Goossen describes it as "under-tweaking" the system. Actual in-production games were then used to determine how to make use of the available headroom.

"Balance is so key to real effective performance. It's been really nice on Xbox One with Nick and his team - the system design folks have built a system where we've had the opportunity to check our balances on the system and make tweaks accordingly," Goossen reveals. "Did we do a good job when we did all of our analysis and simulations a couple of years ago, and guessing where games would be in terms of utilisation. Did we make the right balance decisions back then? And so raising the GPU clock is the result of going in and tweaking our balance."

"We knew we had headroom. We didn't know what we wanted to do with it until we had real titles to test on. How much do you increase the GPU by? How much do you increase the CPU by?" asks Nick Baker.

"We had the headroom. It's a glorious thing to have on a console launch. Normally you're talking about having to downclock," says Goossen. "We had a once in a lifetime opportunity to go and pick the spots where we wanted to improve the performance and it was great to have the launch titles to use as the way to drive an informed decision on performance improvements we could get out of the headroom."

Goossen also reveals that the Xbox One silicon actually contains additional compute units - as we previously speculated. The presence of that redundant hardware (two CUs are disabled on retail consoles) allowed Microsoft to judge the importance of compute power versus clock-speed:

"Every one of the Xbox One dev kits actually has 14 CUs on the silicon. Two of those CUs are reserved for redundancy in manufacturing, but we could go and do the experiment - if we were actually at 14 CUs what kind of performance benefit would we get versus 12? And if we raised the GPU clock what sort of performance advantage would we get? And we actually saw on the launch titles - we looked at a lot of titles in a lot of depth - we found that going to 14 CUs wasn't as effective as the 6.6 per cent clock upgrade that we did."

Assuming level scaling of compute power with the addition of two extra CUs, the maths may not sound right here, but as our recent analysis - not to mention PC benchmarks - reveals, AMD compute units don't scale in a linear fashion. There's a law of diminishing returns.

"Every one of the Xbox One dev kits actually has 14 CUs on the silicon... And we actually saw on the launch titles... we found that going to 14 CUs wasn't as effective as the 6.6 per cent clock upgrade that we did."

Caption

Attribution

"Everybody knows from the internet that going to 14 CUs should have given us almost 17 per cent more performance," he says, "but in terms of actual measured games - what actually, ultimately counts - is that it was a better engineering decision to raise the clock. There are various bottlenecks you have in the pipeline that can cause you not to get the performance you want if your design is out of balance."

"Increasing the frequency impacts the whole of the GPU whereas adding CUs beefs up shaders and ALU," interjects Nick Baker.

"Right. By fixing the clock, not only do we increase our ALU performance, we also increase our vertex rate, we increase our pixel rate and ironically increase our ESRAM bandwidth," continues Goossen.

"But we also increase the performance in areas surrounding bottlenecks like the drawcalls flowing through the pipeline, the performance of reading GPRs out of the GPR pool, etc. GPUs are giantly complex. There's gazillions of areas in the pipeline that can be your bottleneck in addition to just ALU and fetch performance."

GPU Compute and the importance of the CPU

Goossen also believes that leaked Sony documents on VGLeaks bear out Microsoft's argument:

"Sony was actually agreeing with us. They said that their system was balanced for 14 CUs. They used that term: balance. Balance is so important in terms of your actual efficient design. Their additional four CUs are very beneficial for their additional GPGPU work. We've actually taken a very different tack on that. The experiments we did showed that we had headroom on CUs as well. In terms of balance, we did index more in terms of CUs than needed so we have CU overhead. There is room for our titles to grow over time in terms of CU utilisation."

Microsoft's approach to asynchronous GPU compute is somewhat different to Sony's - something we'll track back on at a later date. But essentially, rather than concentrate extensively on raw compute power, their philosophy is that both CPU and GPU need lower latency access to the same memory. Goossen points to the Exemplar skeletal tracking system on Kinect on Xbox 360 as an example for why they took that direction.

"Exemplar ironically doesn't need much ALU. It's much more about the latency you have in terms of memory fetch, so this is kind of a natural evolution for us," he says. "It's like, OK, it's the memory system which is more important for some particular GPGPU workloads."

The team is also keen to emphasise that the 150MHz boost to CPU clock speed is actually a whole lot more important than many believe it is.

"Interestingly, the biggest source of your frame-rate drops actually comes from the CPU, not the GPU," Goossen reveals. "Adding the margin on the CPU... we actually had titles that were losing frames largely because they were CPU-bound in terms of their core threads. In providing what looks like a very little boost, it's actually a very significant win for us in making sure that we get the steady frame-rates on our console."

This in part explains why several of the custom hardware blocks - the Data Move Engines - are geared towards freeing up CPU time. Profiling revealed that this was a genuine issue, which has been balanced with a combination of the clock speed boost and fixed function silicon - the additional processors built in to the Xbox One processor.

"We've got a lot of CPU offload going on. We've got the SHAPE, the more efficient command processor relative to the standard design, we've got the clock boost - it's in large part actually to ensure that we've got the headroom for the frame-rates," Goossen continues - but it seems that the systems's Data Move Engines can help the GPU too.

"Imagine you've rendered to a depth buffer there in ESRAM. And now you're switching to another depth buffer. You may want to go and pull what is now a texture into DDR so that you can texture out of it later, and you're not doing tons of reads from that texture so it actually makes more sense for it to be in DDR. You can use the Move Engines to move these things asynchronously in concert with the GPU so the GPU isn't spending any time on the move. You've got the DMA engine doing it. Now the GPU can go on and immediately work on the next render target rather than simply move bits around."

Other areas of custom silicon are also designed to help out the graphics performance.

"We've done things on the GPU side as well with our hardware overlays to ensure more consistent frame-rates," Goossen adds. "We have two independent layers we can give to the titles where one can be 3D content, one can be the HUD. We have a higher quality scaler than we had on Xbox 360. What this does is that we actually allow you to change the scaler parameters on a frame-by-frame basis."

Dynamic resolution scaling isn't new - we've seen it implemented on a lot of current-gen titles. Indeed, the first example in the current generation was on a Sony title: WipEout HD. Impact on image quality can be rough at 720p, but at higher resolutions and in concert with superior scaling, it could be a viable performance equalising measure.

"I talked about CPU glitches causing frame glitches... GPU workloads tend to be more coherent frame to frame. There doesn't tend to be big spikes like you get on the CPU and so you can adapt to that," Goossen explains.

"What we're seeing in titles is adopting the notion of dynamic resolution scaling to avoid glitching frame-rate. As they start getting into an area where they're starting to hit on the margin there where they could potentially go over their frame budget, they could start dynamically scaling back on resolution and they can keep their HUD in terms of true resolution and the 3D content is squeezing. Again, from my aspect as a gamer I'd rather have a consistent frame-rate and some squeezing on the number of pixels than have those frame-rate glitches."

"From a power/efficiency standpoint as well, fixed functions are more power-friendly on fixed function units," adds Nick Baker. "We put data compression on there as well, so we have LZ compression/decompression and also motion JPEG decode which helps with Kinect. So there's a lot more to the Data Move Engines than moving from one block of memory to another."

We've been talking in-depth for over an hour and our time draws to a close. The entire discussion has been completely tech-centric, to the point where we'd almost forgotten that the November launch of Xbox One is likely to be hugely significant for Nick Baker and Andrew Goossen personally. How does it feel to see the console begin to roll off the production line after years in development?

"Yeah, getting something out is always, always a great feeling [but] my team work on multiple programs in parallel - we're constantly busy working on the architecture team," says Baker.

Goossen has the final word:

"For me, the biggest reward is to go and play the games and see that they look great and that yeah, this is why we did all that hard work. As a graphics guy it's so rewarding to see those pixels up on the screen."

The combined system bandwidth controversy

How ESRAM bandwidth doubled in production hardware

Tweaking Xbox One balance and performance

GPU Compute and the importance of the CPU

Read this next