Monday, May 31, 2010


  • The PC construction


    The PC consists of a central unit (referred to as the computer) and various peripherals. The computer is a box, which contains most of the working electronics. It is connected with cables to the peripherals.
    On these pages, I will show you the computer and its components. Here is a picture of the computer:
    Here is a list of the PC components. Read it and ask yourself what the words mean. Do you recognize all these components? They will be covered in the following pages.
    Components in the central unit - the computerPeripherals
    The motherboard: CPU, RAM, cache,
    ROM chips with BIOS and start-up programs.
    Chip sets (controllers). Ports, buses and expansion slots.
    Drives: Hard disk(s), floppy drive(s), CD-ROM, etc.
    Expansion cards: Graphics card (video adapter),
    network controller, SCSI controller.
    Sound card, video and TV card.
    Internal modem and ISDN card.
    Keyboard and mouse.
    Joystick
    Monitor
    Printer
    Scanner
    Loudspeakers
    External drives
    External tape station
    External modem

    So, how are the components connected. What are their functions, and how are they tied together to form a PC? That is the subject of Click and Learn. So, please continue reading...

    The von Neumann Model of the PC


    Computers have their roots 300 years back in history. Mathematicians and philosophers like Pascal, Leibnitz, Babbage and Boole made the foundation with their theoretical works. Only in the second half of this century was electronic science sufficiently developed to make practical use of their theories.
    The modern PC has roots that go back to the USA in the 1940s. Among the many scientists, I like to remember John von Neumann (1903-57). He was a mathematician, born in Hungary. We can still use his computer design today. He broke computer hardware down in five primary parts:

  • CPU

  • Input

  • Output

  • Working memory

  • Permanent memory
    Actually, von Neumann was the first to design a computer with a working memory (what we today call RAM). If we apply his model to current PCs, it will look like this:
    All these subjects will be covered.

    Data exchange - the motherboard


    The ROM chips contain instructions, which are specific for that particular motherboard. Those programs and instructions will remain in the PC throughout its life; usually they are not altered.
    Primarily the ROM code holds start-up instructions. In fact there are several different programs inside the start-up instructions, but for most users, they are all woven together. You can differentiate between:

  • POST (Power On Self Test)

  • The Setup instructions, which connect with the CMOS instructions

  • BIOS instructions, which connect with the various hardware peripherals

  • The Boot instructions, which call the operating system (DOS, OS/2, or Windows )
    All these instructions are in ROM chips, and they are activated one by one during start-up. Let us look at each part.


    The suppliers of system software


    All PCs have instructions in ROM chips on the motherboard. The ROM chips are supplied by specialty software manufacturers, who make BIOS chips. The primary suppliers are:

  • Phoenix

  • AMI ( American Megatrends )

  • Award
    You can read the name of your BIOS chip during start-up. You can also see the chip on the system board. Here is a picture (slightly blurred) of an Award ROM chip:
    Here is an AMI chip with BIOS and start-up instructions:

  • The CPU’s immediate surroundings

    In this part of this guide, we dug down into the inner workings of the CPU. We well let it rest in peace now, and concentrate on the processor’s immediate surroundings. That is, the RAM and the chipset – or more precisely, the north bridge.
    In the first section of the guide I introduced the chipset, including the north bridge (see, for example, Fig. 46 on page19), which connects the CPU to the PC’s memory — the RAM.

    The pathway to RAM

    The most important data path on the motherboard runs between the CPU and the RAM. Data is constantly pumped back and forth between the two, and this bus therefore often comes under focus when new generations of CPU’s, chipset’s and motherboards are released.
    The RAM sends and receives data on a bus, and this work involves a clock frequency. This means that all RAM has aspeed, just like a CPU does. Unfortunately RAM is much slower than the CPU, and the buses on the motherboard have to make allowance for this fact.

    The XT architecture

    In the original PC design (the IBM XT), the CPU, RAM and I/O devices (which we will come to later) were connected on one and the same bus, and everything ran synchronously (at a common speed). The CPU decided which clock frequency the other devices had to work at:
    Fig. 118. In the original PC architecture, there was only one bus with one speed.
    The problem with this system was that the three devices were “locked to each other”; they were forced to work at the lowest common clock frequency. It was a natural architecture in the first PC’s, where the speed was very slow.

    The first division of the bus

    In 1987, Compaq hit on the idea of separating the system bus from the I/O bus, so that the two buses could work at different clock frequencies. By letting the CPU and RAM work on their own bus, independent of the I/O devices, their speeds could be increased.
    In Fig. 119, the CPU and RAM are connected to a common bus, called the system bus, where in reality the CPU’s clock frequency determines the working speed. Thus the RAM has the same speed as the CPU; for example, 12, 16 or 25 MHz.
    Fig. 119. With this architecture, the I/O bus is separate from the system bus (80386).
    The I/O devices (graphics card, hard disk, etc.) were separated from the system bus and placed on a separate low speed bus. This was because they couldn’t keep up with the clock frequencies of the new CPU versions.
    The connection between the two buses is managed by a controller, which functions as a “bridge” between the two paths. This was the forerunner of the multibus architecture which all motherboards use today.

    Clock doubling

    With the introduction of the 80486, the CPU clock frequency could be increased so much that the RAM could no longer keep up. Intel therefore began to use clock doubling in the 80486 processor.
    The RAM available at the time couldn’t keep up with the 66 MHz speed at which an 80486 could work. The solution was to give the CPU two working speeds.

  • An external clock frequency

  • An internal clock frequency
    Inside the processor, the clock frequency of the system bus is multiplied by a factor of 2, doubling the working speed.
    Fig. 120. The bus system for an 80486 processor.
    But this system places heavy demands on the RAM, because when the CPU internally processes twice as much data, it of course has to be “fed” more often. The problem is, that the RAM only works half as fast as the CPU.
    For precisely this reason, the 486 was given a built-in L1 cache, to reduce the imbalance between the slow RAM and the fast processor. The cache doesn’t improve the bandwidth (the RAM doesn’t work any faster), but it ensures greater efficiency in the transfer of data to the CPU, so that it gets the right data supplied at the right time.
    Clock doubling made it possible for Intel to develop processors with higher and higher clock frequencies. At the time the Pentium was introduced, new RAM modules became available, and the system bus was increased to 66 MHz. In the case of the Pentium II and III, the system bus was increased to 100 and 133 MHz, with the internal clock frequency set to a multiple of these.
    Figur 121. The bus system for a Pentium III processor.

  • Chapter 15. Evolution of the Pentium 4
    As was mentioned earlier, the older P6 architecture was released back in 1995. Up to 2002, the Pentium III processors were sold alongside the Pentium 4. That means, in practise, that Intel’s sixth CPU generation has lasted 7 years.
    Similarly, we may expect this seventh generation Pentium 4 to dominate the market for a number of years. The processors may still be called Pentium 4, but it comes in al lot varietes.
    A mayor modification comes with the version using 0.65 micron process technology. It will open for higher clock frequencies, but there will also be a number of other improvements.
    Hyper-Threading Technology is a very exciting structure, which can be briefly outlined as follows: In order to exploit the powerful pipeline in the Pentium 4, it has been permitted to process two threads at the same time. Threads are series of software instructions. Normal processors can only process one thread at a time.
    In servers, where several processors are installed in the same motherboard (MP systems), several threads can be processed at the same time. However, this requires that the programs be set up to exploit the MP system, as discussed on page 31.
    The new thing is that a single Pentium 4 logically can function as if there physically were two processors in the pc. The processor core (with its long pipelines) is simply so powerful that it can, in many cases, act as two processors. It’s a bit like one person being able to carry on two independent telephone conversations at the same time.
    Figur 110. The Pentium 4 is ready for MP functions.
    Hyper-Threading works very well in Intel’s Prescott-versions of Pentium 4. You gain performance when you operate more than one task at the time. If you have two programs working simultaneously, both putting heavy pressure on the CPU, you will benefit from this technology. But you need a MP-compatible operating system (like Windows XP Professional) to benefit from it.
    The next step in this evolution is the production of dual-core processors. AMD produces Opteron chips which hold two processors in one chip. Intel is working on dual core versions of the Pentium 4 (with the codename ”Smithfield”). These chips will find use in servers and high performance pc’s. A dual core Pentium 4 with Hyper-Threading enabled will in fact operate as a virtual quad-core processor.
    Figur 111. A dual core processor with Hyper Threading operates as virtual quad-processor.
    Intel also produces EE-versions of the Pentium 4. EE is for Extreme Edition, and these processors are extremely speedy versions carrying 2 MB of L2 cache. 
    In late 2004 Intel changed the socket design of the Pentium 4. The new processors have no ”pins”; they connect directly to the socket using little contacts in the processor surface.
    Figur 112. The LGA 775 socket for Pentium 4.

    Athlon

    The last processor I will discuss is the popular Athlon and Athlon 64 processor series (or K7 and K8).
    It was a big effort on the part of the relatively small manufacturer, AMD, when they challenged the giant Intel with a complete new processor design.
    The first models were released in 1999, at a time when Intel was the completely dominant supplier of PC processors. AMD set their sights high – they wanted to make a better processor than the Pentium II, and yet cheaper at the same time. There was a fierce battle between AMD and Intel between 1999 and 2001, and one would have to say that AMD was the victor. They certainly took a large part of the market from Intel.
    The original 1999 Athlon was very powerfully equipped with pipelines and computing units:

  •  Three instruction decoders which translated X86 program CISC instructions into the more efficient RISC instructions (ROP’s) – 9 of which could be executed at the same time.

  •  Could handle up to 72 instructions (ROP out of order) at the same time (the Pentium III could manage 40, the K6-2 only 24).

  •  Very strong FPU performance, with three simultaneous instructions.
    All in all, the Athlon was in a class above the Pentium II and III in those years. Since Athlon processors were sold at competitive prices, they were incredibly successful. They also launched the Duron line of processors, as the counterpart to Intel’s Celeron, and were just as successful with it.
    Figur 113. Athlon was a huge success for AMD. During 2001-2002, the Athlon XP was in strong competition with the Pentium 4.
     

    Athlon XP versus Pentium 4

    The Athlon processor came in various versions. It started as a Slot A module (see Fig. 107 on page 42). It was then moved to Socket A, when the L2 cache was integrated.
    In 2001, a new Athlon XP version was released, which included improvements like a new Hardware Auto Data Prefetch Unit and a bigger Translation Look-aside Buffer. The Athlon XP was much less advanced than the Pentium 4 but quite superior at clock frequencies less than 2000 MHz. A 1667 MHz version of AthlonXP was sold as 2000+. This indicates, that the processor as a minimum performs like a 2000 MHz Pentium 4.
    Later we saw Athlons in other versions. The latest was based on a new kernel called ”Barton”. It was introduced in 2003 with a L2-cachen of 512 KB. AMD tried to sell the 2166 MHz version under the brand 3000+. It did not work. A Pentium 4 running at 3000 MHz had no problems outperforming the Athlon.

    Opteron/ Athlon64

    AMD’s 8th generation CPU was released in 2003. It is based on a completely new core called Hammer.
    A new series of 64-bits processors is called Athlon 64, Athlon 64 FX and Opteron. These CPU’s has a new design in two areas:

  •  The memory controller is integrated in the CPU. Traditionally this function has been housed in the north bridge, but now it is placed inside the processor.

  •  AMD introduces a completely new 64-bit set of instructions.
    Moving the memory controller into the CPU is a great innovation. It gives a much more efficient communication between CPU and RAM (which has to be ECC DDR SDRAM – 72 bit modules with error correction).)
    Every time the CPU has to fetch data from normal RAM, it has to first send a request to the chipset’s controller. It has to then wait for the controller to fetch the desired data – and that can take a long time, resulting in wasted clock ticks and reduced CPU efficiency. By building the memory controller directly into the CPU, this waste is reduced. The CPU is given much more direct access to RAM. And that should reduce latency time and increase the effective bandwidth.
    The Athlon 64 processors are designed for 64 bits applications. This should be more powerful than the existing 32 bit software. We will probably see plenty of new 64 bit software in the future, since Intel is releasing 64 bit processors compatible with the Athlon 64 series.
    Figur 114. In the Athlon 64 the memory controller is located inside the processor. Hence, the RAM modules are interfacing directly with the CPU.
    Overall the Athlon 64 is an updated Athlon-processor with integrated north bridge and 64 bits  instructions. Other news are:

  •  Support for SSE2 instructions and 16 registers for this.

  •  Dual channel interface to DDR RAM giving a 128 bit memory bus, although the discount version Athlon 64 keeps the 64 bit bus.

  •  Communikationen to and from the south bridge via a new HyperTransport bus, operating with high-speed serial transfer.

  •  New sockets of 754 and 940 pins.

    A complete line of chips

    AMD expects to use the K8 kernel in all types of processors:

    The Opteron is the most expensive and advanced version to be used in multi-processor servers. The models are called 200, 400 and 800, and they use 2, 4 or 8 CPUs on the same motherboard – without use of a north bridge.
    All processors share a common memory of up to 64 GB. Each Opteron has three Hyper­Transport I/O channels, which each can move 6,4 GB/secund.
    The Athlon FX is a Opteron to be used in single processor configurations, high-end pc’s and workstations. There is dual RAM interface, but only one channel of Hyper Transport Link.
    This is the discount version with reduced performance and lower prices. Only 64 bit RAM interface and smaller L2-cache.
    Figur 115. Three versions of the latest AMD processor.

    Historical overview

    I will close off this review with a graphical summary of a number of different CPU’s from the last 25 years. The division into generations is not always crystal clear, but I have tried to present things in a straightforward and reasonably accurate way:
    Figur 116. There are scores of different processors. A selection of them is shown here, divided into generations.
    But what is the most powerful CPU in the world? IBM’s Power4 must be a strong contender. It is a monster made up of 8 integrated 64-bit processor cores. It has to be installed in a 5,200 pin socket, uses 500 watts of power (there are 680 million transistors), and connects to a 32 MB L3 cache, which it controls itself. Good night to Pentium

  • Chapter 14.Examples of CPU’s

    In this chapter I will briefly describe the important CPU’s which have been on the market, starting from the PC’s early childhood and up until today.
    One could argue that the obsolete and discontinued models no longer have any practical significance. This is true to some extent; but the old processors form part of the “family tree”, and there are still legacies from their architectures in our modern CPU’s, because the development has been evolutionary. Each new processor extended and built “on top of” an existing architecture.
    Fig.  98. The evolutionary development spirals ever outwards.
    There is therefore value (one way or another) in knowing about the development from one generation of CPU’s to the next. If nothing else, it may give us a feeling for what we can expect from the future.

    16 bits – the 8086, 8088 and 80286

    The first PC’s were 16-bit machines. This meant that they could basically only work with text. They were tied to DOS, and could normally only manage one program at a time.
    But the original 8086 processor was still “too good” to be used in standard office PC’s. The Intel 8088 discount model was therefore introduced, in which the bus between the CPU and RAM was halved in width (to 8 bits), making production of the motherboard much cheaper. 8088 machines typically had 256 KB, 512 KB or 1 MB of RAM. But that was adequate for the programs at the time.
    The Intel 80286 (from 1984) was the first step towards faster and more powerful CPU’s. The 286 was much more efficient; it simply performed much more work per clock tick than the 8086/8088 did. A new feature was also the 32 bit protected mode – a new way of working which made the processor much more efficient than under real mode, which the 8086/8088 processor forced programs to work in:

  • Access to all system memory – even beyond the 1MB limit which applied to real mode.
    Access to multitasking, which means that the operating system can run several programs at the same time.

  • The possibility of virtual memory, which means that the hard disk can be used to emulate extra RAM, when necessary, via a swap file.

  • 32 bit access to RAM and 32 bit drivers for I/O devices.
    Protected mode paved the way for the change from DOS to Windows, which only came in the 1990’s.
    Fig.  99. Bottom: an Intel 8086, the first 16-bit processor. Top: the incredibly popular 8-bit processor, the Zilog Z80, which the 8086 and its successors out competed.

    32 bits – the 80386 and 486

    The Intel 80386 was the first 32-bit CPU. The 386 has 32-bit long registers and a 32-bit data bus, both internally and externally. But for a traditional DOS based PC, it didn’t bring about any great revolution. A good 286 ran nearly as fast as the first 386’s – under DOS anyway, since it doesn’t exploit the 32-bit architecture.
    The 80386SX became the most popular chip – a discount edition of the 386DX. The SX had a 16-bit external data bus (as opposed to the DX’s 32-bit bus), and that made it possible to build cheap PC’s.
    Fig.  100. Discount prices in October 1990 – but only with a b/w monitor.

    The fourth generation

    The fourth generation of Intel’s CPU’s was called the 80486. It featured a better implementation of the x86 instructions – which executed faster, in a more RISC-like manner. The 486 was also the first CPU with built-in L1 cache. The result was that the 486 worked roughly twice as fast as its predecessor – for the same clock frequency.
    With the 80486 we gained a built-in FPU. Then Intel did a marketing trick of the type we would be better off without. In order to be able to market a cheap edition of the 486, they hit on the idea of disabling the FPU function in some of the chips. These were then sold under the name, 80486SX. It was ridiculous – the processors had a built-in FPU; it had just been switched off in order to be able to segment the market.
    Fig.  101. Two 486’s from two different manufacturers.
    But the 486 was a good processor, and it had a long life under DOS, Windows 3.11 and Windows 95. New editions were released with higher clock frequencies, as they hit on the idea of doubling the internal clock frequency in relation to the external (see the discussion later in the guide). These double-clocked processors were given the name, 80486DX2.
    A very popular model in this series had an external clock frequency of 33 MHz (in relation to RAM), while working at 66MHz internally. This principle (double-clocking) has been employed in one way or another in all later generations of CPU’s. AMD, IBM, Texas Instruments and Cyrix also produced a number of 80486 compatible CPU’s.

    Pentium

    In 1993 came the big change to a new architecture. Intel’s Pentium was the first fifth-generation CPU. As with the earlier jumps to the next generation, the first versions weren’t especially fast. This was particularly true of the very first Pentium 60 MHz, which ran on 5 volts. They got burning hot – people said you could fry an egg on them. But the Pentium quickly benefited from new process technology, and by using clock doubling, the clock frequencies soon skyrocketed.
    Basically, the major innovation was a superscalar architecture. This meant that the Pentium could process several instructions at the same time (using several pipelines). At the same time, the RAM bus width was increased from 32 to 64 bits.
    Fig.  102. The Pentium processor could be viewed as two 80486’s built into one chip.
    Throughout the 1990’s, AMD gained attention with its K5 and K6 processors, which were basically cheap (and fairly poor) copies of the Pentium. It wasn’t until the K6-2 (which included the very successful 3DNow! extensions), that AMD showed the signs of independence which have since led to excellent processors like the AthlonXP.
    Fig.  103. One of the earlier AMD processors. Today you’d hesitate to trust it to run a coffee machine…
    In 1997, the Pentium MMX followed (with the model name P55), introducing the MMX instructions already mentioned. At the same time, the L1 cache was doubled and the clock frequency was raised.
    Fig.  104. The Pentium MMX. On the left, the die can be seen in the middle.

    Pentium II with new cache

    After the Pentium came the Pentium II. But Intel had already launched the Pentium Pro in 1995, which was the first CPU in the 6th generation. The Pentium Pro was primarily used in servers, but its architecture was re-used in the popular Pentium II, Celeron and Pentium III models, during 1997-2001.
    The Pentium II initially represented a technological step backwards. The Pentium Pro used an integrated L2 cache. That was very advanced at the time, but Intel chose to place the cache outside the actual Pentium II chip, to make production cheaper.
    Fig.  105. L2 cache running at half CPU speed in the Pentium II.
    The Level 2 cache was placed beside the CPU on a circuit board, an SEC module (e.g. see Fig. 71, on page 28).  The module was installed in a long Slot 1 socket on the motherboard. Fig. 106 shows the module with a cooling element attached.  The CPU is sitting in the middle (under the fan). The L2 cache is in two chips, one on each side of the processor.
    Fig.  106. Pentium II processor module mounted on its edge in the motherboard’s Slot 1 socket (1997-1998).
    The disadvantage of this system was that the L2 cache became markedly slower than it would have been if it was integrated into the CPU. The L2 cache typically ran at half the CPU’s clock frequency. AMD used the same system in their first Athlons. For these the socket was called, Slot A (see Fig. 107).
    At some point, Intel decided to launch a discount edition of the Pentium II – the Celeron processor. In the early versions, the L2 cache was simply scrapped from the module. That led to quite poor performance, but provided an opportunity for overclocking.
    Overclocking means pushing a CPU to work at a higher frequency than it is designed to work at. It was a very popular sport, especially early on, and the results were good.
    Fig.  107. One of the first AMD Athlon processors, mounted in a Slot A socket. See the large cooling element.
    One of the problems of overclocking a Pentium II was that the cache chips couldn’t keep up with the high speeds. Since these Celerons didn’t have any L2 cache, they could be seriously overclocked (with the right cooling).
    Fig.  108. Extreme CPU cooling using a complete refrigerator built into the PC cabinet. With equipment like this, CPU’s can be pushed up to very high clock frequencies (See Kryotech.com and Asetek.com).
    Intel later decided to integrate the L2 cache into the processor. That happened in a new versions of the Celeron in 1998 and a new versions of the Pentium III in 1999. The socket design was also changed so that the processors could be mounted directly on the motherboard, in a socket called socket 370. Similarly, AMD introduced their socket A.

    Pentium 4 – long in the pipe

    The Pentium III was really just (yet) another edition of the Pentium II, which again was a new version of the Pentium Pro. All three processors built upon the same core architecture (Intel P6).
    It wasn’t until the Pentium 4 came along that we got a completely new processor from Intel. The core (P7) had a completely different design:

  • The L1 cache contained decoded instructions.

  • The pipeline had been doubled to 20 stages (in later versions increased to 31 stages).

  • The integer calculation units (ALU’s) had been double-clocked so that they can perform two micro operations per clock tick.

  • Furthermore, the memory bus, which connects the RAM to the north bridge, had been quad-pumped, so that it transfers four data packets per clock tick. That is equivalent to 4 x 100 MHz and 4 x 133 in the earliest versions of the Pentium 4. In later version the bus was pumped up to 4 x 200 MHz, and an update with 4 x 266 MHz is scheduled for 2005.

  • The processor was Hyper Threading-enabled, meaning that it under certain circumstances may operate as two individual CPUs.
    All of these factors are described elsewhere in the guide. The important thing to understand, is that the Pentium 4 represents a completely new processor architecture.
    Fig.  109. The four big changes seen in the Pentium 4.

  • 3D graphics

    Much of the development in CPU’s has been driven by 3D games. These formidable games (like Quake and others) place incredible demands on CPU’s in terms of computing power.  When these programs draw people and landscapes which can change in 3-dimensional space, the shapes are constructed from tiny polygons (normally triangles or rectangles).
    Fig.  95. The images in popular games like Sims are constructed from hundreds of polygons.
    A character in a PC game might be built using 1500 such polygons. Each time the picture changes, these polygons have to be drawn again in a new position. That means that every corner (vortex) of every polygon has to be re-calculated.
    In order to calculate the positions of the polygons, floating point numbers have to be used (integer calculations are not nearly accurate enough). These numbers are called single-precision floating points and are 32 bits long. There are also 64-bit numbers, called double-precision floating points, which can be used for even more demanding calculations.
    When the shapes in a 3D landscape move, a so-called matrix multiplication has to be done to calculate the new vortexes. For just one shape, made up of, say, 1000 polygons, up to 84,000 multiplications have to be performed on pairs of 32-bit floating point numbers. And this has to happen for each new position the shape has to occupy. There might be 75 new positions per second. This is quite heavy computation, which the traditional PC is not very good at. The national treasury’s biggest spreadsheet is child’s play compared to a game like Quake, in terms of the computing power required.
    The CPU can be left gasping for breath when it has to work with 3D movements across the screen. What can we do to help it? There are several options:

  • Generally faster CPUs. The higher the clock frequency, the faster the traditional FPU performance will become.

  • Improvements to the CPU’s FPU, using more pipelines and other forms of acceleration. We see this in each new generation of CPU’s.

  • New instructions for more efficient 3D calculations.
    We have seen that clock frequencies are constantly increasing in the new generations of CPU. But the FPU’s themselves have also been greatly enhanced in the latest generations of CPU’s. The Athlon, especially, is far more powerfully equipped in this area compared to its predecessors.
    The last method has also been shown to be very effective. CPU’s have simply been given new registers and newinstructions which programmers can use.

    MMX instructions

    The first initiative was called MMX (multimedia extension), and came out with the Pentium MMX processor in 1997. The processor had built-in “MMX instructions” and “MMX registers”.
    The previous editions of the Pentium (like the other 32 bit processors) had two types of register: One for 32-bit integers, and one for 80-bit decimal numbers. With MMX we saw the introduction of a special 64-bit integer register which works in league with the new MMX instructions. The idea was (and is) that multimedia programs should exploit the MMX instructions. Programs have to be “written for” MMX, in order to utilise the new system.
    MMX is an extension to the existing instruction set (IA32). There are 57 new instructions which MMX compatible processors understand, and which require new programs in order to be exploited.
    Many programs were rewritten to work both with and without MMX (see Fig 96). Thus these programs could continue to run on older processors, without MMX, where they just ran slower.
    MMX was a limited success. There is a weakness in the design in that programs either work with MMX, or with the FPU, and not both at the same time – as the two instruction sets share the same registers. But MMX laid the foundation for other multimedia extensions which have been much more effective.
    Fig  96. This drawing program (Painter) supports MMX, as do all modern programs.

    3DNow!

    In the summer of 1998, AMD introduced a collection of CPU instructions which improved 3D processing. These were 21 new SIMD (Single Instruction Multiple Data) instructions. The new instructions could process several chunks of data with one instruction. The new instructions were marketed under the name, 3DNow!. They particularly improved the processing of the 32-bit floating point numbers used so extensively in 3D games.
    Fig.  97. 3DNow! became the successor to MMX.
    3DNow! was a big success. The instructions were quickly integrated into Windows, into various games (and other programs) and into hardware manufacturers’ driver programs.

    SSE

    After AMD’s success with 3DNow!, Intel had to come back with something else. Their answer, in January 1999, was SSE (Streaming SIMD Extensions), which are another way to improve 3D performance. SSE was introduced with the Pentium III.
    In principle, SSE is significantly more powerful than 3DNow! The following changes were made in the CPU:

  • 8 new 128-bit registers, which can contain four 32-bit numbers at a time.

  • 50 new SIMD instructions which make it possible to do advanced calculations on several floating point numbers with just one instruction.

  • 12 New Media Instructions, designed, for example, for the encoding and decoding of MPEG-2 video streams (in DVD).

  • 8 new Streaming Memory instructions to improve the interaction between L2 cache and RAM.
    SSE also quickly became a success. Programs like Photoshop were released in new SSE optimised versions, and the results were convincing. Very processor-intensive programs involving sound, images and video, and in the whole area of multimedia, run much more smoothly when using SSE.
    Since SSE was such a clear success, AMD took on board the technology. A large part of SSE was built into the AthlonXP and Duron processors. This was very good for software developers (and hence for us users), since all software can continue to be developed for one instruction set, common to both AMD and Intel.

    SSE2 and SSE3

    With the Pentium 4, SSE was extended to use even more powerful techniques. SSE2 contains 144 new instructions, including 128-bit SIMD integer operations and 128-bit SIMD double-precision floating-point operations.
    SSE2 can reduce the number of instructions which have to be executed by the CPU in order to perform a certain task, and can thus increase the efficiency of the processor. Intel mentions video, speech recognition, image/photo processing, encryption and financial/scientific programs as the areas which will benefit greatly from SSE2. But as with MMX, 3DNow! and SSE, the programs have to be rewritten before the new instructions can be exploited.
    SSE2 adopted by the competition, AMD, in the Athlon 64-processors. Here AMD even doubled up the number of SSE2 registers compared to the Pentium 4. Latest Intel has introduced 13 new instructions in SSE3, which Intel uses in the Prescott-version of Pentium 4.
    We are now going to leave the discussion of instructions. I hope this examination has given you some insight into the CPU’s work of executing programs.


  • Chapter 12. Data and instructions

    Now it’s time to look more closely at the work of the CPU. After all, what does it actually do?

    Instructions and data

    Our CPU processes instructions and data. It receives orders from the software. The CPU is fed a gentle stream of binary data via the RAM.
    These instructions can also be called program code. They include the commands which you constantly – via user programs – send to your PC using your keyboard and mouse. Commands to print, save, open, etc.
    Data is typically user data. Think about that email you are writing. The actual contents (the text, the letters) is user data. But when you and your software say “send”, your are sending program code (instructions) to the processor:
    Fig.  80. The instructions process the user data.

    Instructions and compatibility

    Instructions are binary code which the CPU can understand. Binary code (machine code) is the mechanism by which PC programs communicate with the processor.
    All processors, whether they are in PC’s or other types of computers, work with a particular instruction set. These instructions are the language that the CPU understands, and thus all programs have to communicate using these instructions. Here is a simplified example of some “machine code” – instructions written in the language the processor understands:
    proc near
    mov AX,01
    mov BX,01
    inc AX
    add BX,AX
    You can no doubt see that it wouldn’t be much fun to have to use these kinds of instructions in order to write a program. That is why people use programming tools. Programs are written in a programming language (like Visual Basic or C++). But these program lines have to be translated into machine code, they have to be compiled, before they can run on a PC. The compiled program file contains instructions which can be understood by the particular processor (or processor family) the program has been “coded” for:
    Fig.  81. The program code produced has to match the CPU’s instruction set. Otherwise it cannot be run.
    The processors from AMD and Intel which we have been focusing on in this guide, are compatible, in that they understand the same instructions.
    There can be big differences in the way two processors, such as the Pentium and Pentium 4, process the instructions internally. But externally – from the programmer’s perspective – they all basically function the same way. All the processors in the PC family (regardless of manufacturer) can execute the same instructions and hence the same programs.
    And that’s precisely the advantage of the PC: Regardless of which PC you have, it can run the Windows programs you want to use.
    Fig.  82. The x86 instruction set is common to all PC’s.
    As the years have passed, changes have been made in the instruction set along the way. A PC with a Pentium 4 processor from 2002 can handle very different applications to those which an IBM XT with an 8088 processor from 1985 can. But on the other hand, you can expect all the programs which could run on the 8088, to still run on a Pentium 4 and on a Athlon 64. The software is backwards compatible.
    The entire software industry built up around the PC is based on the common x86 instruction, which goes back to the earliest PC’s. Extensions have been made, but the original instruction set from 1979 is still being used.

    x86 and CISC

    People sometimes differentiate between RISC and CISC based CPU’s. The (x86) instruction set of the original Intel 8086 processor is of the CISC type, which stands for Complex Instruction Set Computer.
    That means that the instructions are quite diverse and complex. The individual instructions vary in length from 8 to 120 bits. It is designed for the 8086 processor, with just 29,000 transistors. The opposite of CISC, is RISC instructions.
    RISC stands for Reduced Instruction Set Computer, which is fundamentally a completely different type of instruction set to CISC. RISC instructions can all have the same length (e.g. 32 bits). They can therefore be executed much faster than CISC instructions. Modern CPU’s like the AthlonXP and Pentium 4 are based on a mixture of RISC and CISC.
    Fig.  83. PC’s running Windows still work with the old fashioned CISC instructions.
    In order to maintain compatibility with the older DOS/Windows programs, the later CPU’s still understand CISC instructions. They are just converted to shorter, more RISC-like, sub-operations (called micro-ops), before being executed. Most CISC instructions can be converted into 2-3 micro-ops.
    Fig.  84. The CISC instructions are decoded before being executed in a modern processor. This preserves compatibility with older software.

    Extensions to the instruction set

    For each new generation of CPU’s, the original instruction set has been extended. The 80386 processor added 26 new instructions, the 80486 added six, and the Pentium added eight new instructions.
    At the same time, execution of the instructions was made more efficient. For example, it took an 80386 processor six clock ticks to add one number to a running summation. This task could be done in the 80486 (see page 40), in just two clock ticks, due to more efficient decoding of the instructions.
    These changes have meant that certain programs require at least a 386 or a Pentium processor in order to run. This is true, for example, of all Windows programs. Since then, the MMX and SSE extensions have followed, which are completely new instruction sets which will be discussed later in the guide. They can make certain parts of program execution much more efficient.
    Another innovation is the 64-bit extension, which both AMD and Intel use in their top-processors. Normally the pc operates in 32-bit mode, but one way to improve the performance is using a 64-bit mode. This requires new software, which is not available yet.

    9. Inside the CPU

    Instructions have to be decoded, and not least, executed, in the CPU. I won’t go into details on this subject; it is much too complicated. But I will describe a few factors which relate to the execution of instructions. My description has been extremely simplified, but it is relevant to the understanding of the microprocessor. This chapter is probably the most complicated one in the guide – you have been warned! It’s about:

  • Pipelines

  • Execution units
    If we continue to focus on speeding up the processor’s work, this optimisation must also apply to the instructions – the quicker we can shove them through the processor, the more work it can get done.

    Pipelines

    As mentioned before, instructions are sent from the software and are broken down into micro-ops (smaller sub-operations) in the CPU. This decomposition and execution takes place in a pipeline.
    The pipeline is like a reverse assembly line. The CPU’s instructions are broken apart (decoded) at the start of the pipeline. They are converted into small sub-operations (micro-ops), which can then be processed one at a time in the rest of the pipeline:
    Fig.  85. First the CISC instructions are decoded and converted into more digestible micro instructions. Then these are processed. It all takes place in the pipeline.
    The pipeline is made up of a number stages. Older processors have only a few stages, while the newer ones have many (from 10 to 31). At each stage “something” is done with the instruction, and each stage requires one clock tick from the processor.
    Fig.  86. The pipeline is an assembly line (shown here with 9 stages), where each clock tick leads to the execution of a sub-instruction.
    Modern CPU’s have more than one pipeline, and can thus process several instructions at the same time. For example, the Pentium 4 and AthlonXP can decode about 2.5 instructions per clock tick.
    The first Pentium 4 has several very long pipelines, allowing the processor to hold up to 126 instructions in total, which are all being processed at the same time, but at different stages of execution (see Fig. 88). It is thus possible to get the CPU to perform more work by letting several pipelines work in parallel:
    Fig.  87. Having two pipelines allows twice as many instructions to be executed within the same number of clock ticks.

    CPU
    Instructions executed
    at the same time
    AMD K6-II
    24
    Intel Pentium III
    40
    AMD Athlon
    72
    Intel Pentium 4
    (first generation)
    126

    Fig.  88. By making use of more, and longer, pipelines, processors can execute more instructions at the same time.

    The problems of having more pipelines

    One might imagine that the engineers at Intel and AMD could just make even more parallel pipelines in the one CPU. Perhaps performance could be doubled? Unfortunately it is not that easy.
    It is not possible to feed a large number of pipelines with data. The memory system is just not powerful enough. Even with the existing pipelines, a fairly large number of clock ticks are wasted. The processor core is simply not utilised efficiently enough, because data cannot be brought to it quickly enough.
    Another problem of having several pipelines arises when the processor can decode several instructions in parallel – each in its own pipeline. It is impossible to avoid the wrong instruction occasionally being read in (out of sequence). This is called misprediction and results in a number of wasted clock ticks, since another instruction has to be fetched and run through the “assembly line”.
    Intel has tried to tackle this problem using a Branch Prediction Unit, which constantly attempts to guess the correct instruction sequence.

    Length of the pipe

    The number of “stations” (stages) in the pipeline varies from processor to processor. For example, in the Pentium II and III there are 10 stages, while there are up to 31 in the Pentium 4.
    In the Athlon, the ALU pipelines have 10 stages, while the FPU/MMX/SSE pipelines have 15.
    The longer the pipeline, the higher the processor’s clock frequency can be. This is because in the longer pipelines, the instructions are cut into more (and hence smaller) sub-instructions which can be executed more quickly.

    CPU
    Number of
    pipeline stages
    Maximum clock frequency
    Pentium
    5
    300 MHz
    Motorola G4
    4
    500 MHz
    Motorola G4e
    7
    1000 MHz
    Pentium II and III
    12
    1400 MHz
    Athlon XP
    10/15
    2500 MHz
    Athlon 64
    12/17
    >3000 MHz
    Pentium 4
    20
    >3000 MHz
    Pentium 4 „Prescott
    31
    >5000 MHz

    Fig.  89. Higher clock frequencies require long “assembly lines” (pipelines).
    Note that the two AMD processors have different pipeline lengths for integer and floating point instructions. One can also measure a processor’s efficiency by looking at the IPC number (Instructions Per Clock), and AMD’s Athlon XP is well ahead of the Pentium 4 in this regard. AMD’s Athlon XP processors are actually much faster than the Pentium 4’s at equivalent clock frequencies.
    The same is even more true of the Motorola G4 processors used, for example, in Macintosh computers. The G4 only has a 4-stage pipeline, and can therefore, in principle, offer the same performance as a Pentium 4, with only half the clock frequency or less. The only problem is, the clock frequency can’t be raised very much with such a short pipeline. Intel have therefore chosen to future-proof the Pentium 4 by using a very long pipeline.

    Execution units

    What is it that actually happens in the pipeline? This is where we find the so-called execution units. And we must distinguish between to types of unit:

  • ALU (Arithmetic and Logic Unit)

  • FPU (Floating Point Unit)
    If the processor has a brain, it is the ALU unit. It is the calculating device that does operations on whole numbers (integers). The computer’s work with ordinary text, for example, is looked after by the ALU.
    The ALU is good at working with whole numbers. When it comes to decimal numbers and especially numbers with many decimal places (real numbers as they are called in mathematics), the ALU chokes, and can take a very long time to process the operations. That is why an FPU is used to relieve the load. An FPU is a number cruncher, specially designed for floating point operations.
    There are typically several ALU’s and FPU’s in the same processor. The CPU also has other operation units, for example, the LSU (Load/Store Unit).

    An example sequence

    Look again at Fig. 73 on page 29. You can see that the processor core is right beside the L1 cache. Imagine that an instruction has to be processed:

  • The processor core fetches a long and complex x86 instruction from the L1 instruction cache.

  • The instruction is sent into the pipeline where it is broken down into smaller units.

  • If it is an integer operation, it is sent to an ALU, while floating point operations are sent to an FPU.

  • After processing the data is sent back to the L1 cache.
    This description applies to the working cycle in, for example, the Pentium III and Athlon. As a diagram it might look like this:
    Fig.  90. The passage of instructions through the pipeline.
    But the way the relationship between the pipeline and the execution units is designed differs greatly from processor to processor. So this entire examination should be taken as a general introduction and nothing more.

    Pipelines in the Pentium 4

    In the Pentium 4, the instruction cache has been placed between the “Instruction fetch/Translate” unit (in Fig. 90) and the ALU/FPU. Here the instruction cache (Execution Trace Cache) doesn’t store the actual instructions, but rather the “half-digested” micro-ops.
    Fig.  91. In the Pentium 4, the instruction cache stores decoded micro instructions.
    The actual pipeline in the Pentium 4 is longer than in other CPU’s; it has 20 stages. The disadvantage of the long pipeline is that it takes more clock ticks to get an instruction through it. 20 stages require 20 clock ticks, and that reduces the CPU’s efficiency. This was very clear when the Pentium 4 was released; all tests showed that it was much slower than other processors with the same clock frequency.
    At the same time, the cost of reading the wrong instruction (misprediction) is much greater – it takes a lot of clock ticks to fill up the long assembly line again.
    The Pentium 4’s architecture must therefore be seen from a longer-term perspective. Intel expects to be able to scale up the design to work at clock frequencies of up to 5-10 GHz. In the “Prescott” version of Pentium 4, the pipeline was increased further to 31 stages.
    AMD’s 32 bit Athlon line can barely make it much above a clock frequency of 2 GHz, because of the short pipeline. In comparison, the Pentium 4 is almost ”light years” ahead.
  •