Since its introduction in the mid-90s, Intel's P6 core micro architecture has gone from strength to strength. The initial chip to feature this new design was the Pentium Pro, a chip that most will remember as being the first to integrate the L2 (Level 2) cache with the rest of the chip package, making it extremely expensive. Another benefit of the architecture was its performance running 32bit software. At the time most chips utilised an internal 32bit architecture but only featured a 16bit external data bus. The Pentium Pro extended this to the full 32bits making it far more efficient and significantly faster when executing this type of code. The one drawback to all this performance was the simple fact that very little software took advantage of 32bit processing, and while Windows NT did make extensive use of the Pentium Pro's capabilities the mainstream OS, Windows 95, did not. Combined with the cost issue this meant that the Pentium Pro never became a mainstream processor. And so due to poor 16bit software performance (an issue that was finally becoming less and less important) and high costs the Pentium II was created, still featuring the core elements of the Pentium Pro's P6 architecture, and even with the later arrival of the Pentium III, the core was still based on the original P6. For many years now it has served us well, but never one to stand still, Intel have innovated and designed a new core which forms the heart of the Pentium 4.
In a slight break from tradition Intel haven't named their new core architecture numerically, so instead of P7 being the successor to the P6 core we now have the NetBurst architecture. It isn't difficult to see from some of Intel's more recent advertising campaigns that the Internet has become a focus for promoting their chips, and with their 'interesting' claims that Intel CPU's help to enrich the web experience it isn't difficult to see why they came up with the name NetBurst. So how do the P6 and Netburst designs differ, and how come the Pentium 4 was introduced at an incredible 1.4GHz? To answer both questions we must delve into the very heart of the CPU and take a look at those pipelines that make up the actual processing part of the chip. Chip pipelines are divided into effective sections where certain operations are carried out, and in conventional x86 style chips there is an order that has to be followed: Fetch, Decode, Execute. It is these three steps that must be carried out to do any actual processing, and in each stage of the pipeline a process relating to one of the three is carried out. The longer the pipeline the more complex the instructions can be, but per clock tick less is happening as each individual pipeline stage requires one clock cycle to complete (and potentially longer depending on the instruction and the status of other parts of the chip). It is therefore possible to increase the clock speed more easily with longer pipeline lengths, due to the reduced amount of processing that is going on at each stage. Now in the case of the Pentium III the pipeline is 10 stages long, whereas in the Pentium 4 it has been increased to a whopping 20 stages. This quite drastic architectural change has allowed the P4 to be initially clocked at the 1.4GHz level while the Pentium III seems to be stuck at the 1GHz mark. With this new longer pipeline the P4 is technically slower than a Pentium III at the same clock speed and some initial tests with downclocked P4's and overclocked P3's have borne this out. However, as with all things there are other reasons why the Pentium III is capable of making the P4 look a little lacklustre at times. One of them is the all-important x87 Floating Point Unit (FPU).
Floating maths point?
The FPU became something of a buzz word when comparing gaming performance of Pentium / Pentium II chips to the equivalents from AMD and Cyrix, as at the time the Intel FPU was by far the most efficient and fastest, while the K6 offering from AMD came up somewhat wanting. With the arrival of the Athlon the tables turned a little in AMD's favour and so FPU performance was no longer such an important issue, as both Intel and AMD CPU's carried extremely powerful units. With the advent of the P4 though it appears that FPU performance has reared its ugly head again. In making the chip it seems Intel has made some cutbacks to the P4 and one of these is the x87 FPU. Instead of being a dual super pipelined monster it has been reduced to only a single less efficient pipeline, which cripples its ability to do x87 floating point maths. Before you all throw your arms up in the air and proclaim Intel's latest offspring useless though, one has to look at why the FPU has been cut back so much…
AMD's solution to the weaker FPU on their K6 chips was 3DNOW, an instruction set extension that was designed to enhance floating point maths performance by applying the same instruction to a large data set rather than on a single data item at a time, in a similar manner to Intel's under-performing MMX. This 'single instruction multiple data' (SIMD) processing method works extremely well when large data sets need to have the same instructions carried out on them - in the case of 3DNOW! it was extremely good at doing geometry transforms for games, something that GPU's now take care of. Intel responded in the Pentium III with SSE, which built on MMX by providing special pipelines for carrying out these instructions rather than using the existing FPU pipelines and simply switching the data type when necessary, thereby making such instructions much faster and instantly executable. The new instructions added with SSE also allowed for 64bit data processing, which in theory would significantly speed up any program needing to perform a lot of repetitive floating point maths. Now with the Pentium 4 Intel has added another 144 instructions to create SSE2, which provides even more processing ability with its support for 128bit data sets. It also offers much faster and more accurate floating point calculations than the old x87 FPU, which is why Intel has cut down on the x87 FPU and is hoping that the market will start to compile software to take advantage of these new instructions. As a last point, before we take a look at the actual performance of this new behemoth, there have been some changes to the cache architecture on the chip. Level 1 cache has been reduced to a meagre 8Kb for data storage (as opposed to 16Kb for data and 16Kb for instruction caching on the Pentium II / III) and a 12Kb micro-op instruction cache. The data cache has been reduced to theoretically allow for lower latency, as it can now be accessed in one clock cycle as opposed to the two clock cycles required on the Pentium III, while the micro-op cache is designed to store a potential 12,000 decoded instructions, referred to by Intel as "micro ops". This provides the potential benefit that instructions can be loaded much faster without the need to decode them, thereby helping to remove the slow decode phase from the fetch, decode, execute cycle. The level 2 cache has thankfully been left at 256Kb, although had there been room on the chip it would have been nice to see more!
Where's my backup?
The Pentium 4 is a new chip with a new architecture and a new interface. The next obvious question is where is the new chipset? Enter the i850. Intel have abandoned their 'old' North/South bridge design in favour of a new Hub system which is designed to provide more system bandwidth between components, while also offering better connectivity between system devices. The i850 chipset is the latest offering to take use this 'accelerated hub architecture'. Now while the chips are known as MCH's (Memory Controller Hubs), ICH's (Interface Controller Hubs) and FWH (FirmWare hub), they do essentially work in the same way as the old north/south bridge design. As a result the chipset supports AGP 4x (with fast writes), a quad pumped 100MHz front side bus, dual channel Rambus memory interface, Ultra ATA/100, 4 USB root hub ports and the ubiquitous PCI interface. As I'm sure you'll agree most of these are common to the everyday chipsets that we know and love, with the exception of the quad pumped front side bus and the dual channel Rambus interface. These two features are what really helps the Pentium 4 performance take off. System bandwidth has become a key concern recently, and with AGP 4x requiring 1.06Gb/sec, the PCI bus dragging a maximum 132Mb/sec and other system overheads, it is plain to see that 100MHz memory interfaces can't cope and 133MHz memory systems are only just able to keep up with the pace.
A Change of Pace
To help alleviate this Intel teamed up with Rambus Inc. to provide the next generation in memory technology. While Rambus is technically sound, although the trade off for higher transfer rates is a greatly increased latency, it has fallen down due to its high costs and serious problems that occurred when trying to interface it with the Pentium III. Once these problems had been overcome it became very clear that the Pentium III wasn't actually taking much advantage of the increased bandwidth and so the high price couldn't be justified by a corresponding performance increase. However, the Pentium 4 is extremely bandwidth hungry due to its increased clock speed and need for data, and so Intel have turned to Rambus once again, but with a subtle difference. The front side bus runs at a nominal 100MHz, but using DDR like signalling and other advanced techniques they have pushed the effective rate to four times this (similar to AGP 4x). This offers a theoretical 3.2Gb/sec transfer rate. Rambus is currently only capable of transferring 1.6Gb/sec, so in order to match this up Intel have used a dual channel system where both channels can supply the data bus simultaneously thereby providing the required 3.2Gb/sec (a system first employed with the i840 chipset). This monstrous bandwidth allows the system to take full advantage of the maximum transfer rates of the other peripheral buses, which should seriously enhance the performance of any bandwidth hungry components like hard drives and graphics cards.
Looking at the charts and graphs it is easy to see that the picture isn't necessarily what one would expect from the Pentium 4 though. The 3DMark 2000 numbers show that while the Pentium 4 is faster than the Pentium III, it is not really as fast as one would expect from a CPU which is running at nearly twice the clock speed of the venerable P3-800 used.
The Quake3 numbers certainly show the potential of the Pentium 4 for gaming as the results are nearly twice that of the Pentium III. This certainly does show that there is great potential for the Pentium 4, and for any games based on the Quake 3 engine it could well be the processor to own. Next up we used Sisoft's SANDRA benchmark. First the Pentium III -
Now, the Pentium 4 -
Sisoft's SANDRA shows the Pentium 4 shining through, but in a very different way - it extols the virtues of Rambus, with memory bandwidth numbers revealing 1.4Gb/sec transfer rates, and certainly makes SSE2 look like it could be a great technology, one very much capable of replacing old style x87 instructions in favour of its newer instruction set. Unfortunately SANDRA also shows that the FPU on the Pentium 4 is quite a poor performer in relative terms, which doesn't bode too well for performance in older non-SSE2 enabled apps (basically everything you can find on the shelves today).
The Pentium 4 is certainly a step forward and most probably one in the right direction too, it's just a shame that it couldn't fulfil all of its expectations. The new SSE2 instruction set promises to be a great addition, and something that Intel seem finally to have got right in terms of features and performance. The trouble is that currently only the Intel C++ compiler supports these features, and so until Microsoft release an SSE2 optimised compiler most software and games will continue to utilise older MMX, SSE and x87 FPU instructions. This will certainly not help the Pentium 4 perform well and will therefore make it look more like an overpriced turkey than the newest chip on the block. Despite these concerns regarding the performance of the Pentium 4 one has to remember that in the original switch from 486 technology to Pentium (P5 core) technology there were also some serious performance issues. But once the compilers had been redesigned to take advantage of the P5 architecture the Pentium really took off, and I think that anyone would have had a hard time calling the Pentium slower than the 486. Price is another huge concern for the Pentium 4. Currently the only chipset to use is the i850 and it only supports the RDRAM memory interface. Rambus is extremely expensive, and thanks to the dual channel system the chipset requires that this memory is installed in pairs! Salvation should come shortly though, with the potential release of a DDR SDRAM supporting chipset either from Intel or VIA. When this happens the cost of building a Pentium 4 system will fall, potentially making it more attractive to a wider market. Whatever happens it seems that Intel is pretty much committed to the Pentium 4, and with their bulging marketing muscle they are likely to sell quite a few of the little blighters. I just hope that software starts to take advantage of its features, as I for one can't wait to see what it can really do.
Will you support Eurogamer?