x86 (also known as 80x86
The term is not synonymous with IBM PC compatibility, as this implies a multitude of other computer hardware. and general-purpose computers used x86 chips before the PC-compatible market started, some of them before the IBM PC (1981) debut.
, most desktop computer and laptop computers sold are based on the x86 architecture family, while mobile categories such as or tablet computer are dominated by ARM. At the high end, x86 continues to dominate computation-intensive workstation and cloud computing segments.
A few years after the introduction of the 8086 and 8088, Intel added some complexity to its naming scheme and terminology as the "iAPX" of the ambitious but ill-fated Intel iAPX 432 processor was tried on the more successful 8086 family of chips, applied as a kind of system-level prefix. An 8086 system, including such as 8087 and 8089, and simpler Intel-specific system chips, was thereby described as an iAPX 86 system. There were also terms iRMX (for operating systems), iSBC (for single-board computers), and iSBX (for multimodule boards based on the 8086 architecture), all together under the heading Microsystem 80. However, this naming scheme was quite temporary, lasting for a few years during the early 1980s.
Although the 8086 was primarily developed for embedded systems and small multi-user or single-user computers, largely as a response to the successful 8080-compatible Zilog Z80, the x86 line soon grew in features and processing power. Today, x86 is ubiquitous in both stationary and portable personal computers, and is also used in midrange computers, , servers, and most new supercomputer computer cluster of the TOP500 list. A large amount of software, including a large list of are using x86-based hardware.
Modern x86 is relatively uncommon in , however; small low power applications (using tiny batteries), and low-cost microprocessor markets, such as and toys, lack significant x86 presence. Simple 8- and 16-bit based architectures are common here, as well as simpler RISC architectures like ARM and RISC-V, although the x86-compatible VIA C7, VIA Nano, AMD's Geode, Athlon Neo and Intel Atom are examples of 32- and 64-bit designs used in some relatively low-power and low-cost segments.
There have been several attempts, including by Intel, to end the market dominance of the "inelegant" x86 architecture designed directly from the first simple 8-bit microprocessors. Examples of this are the iAPX 432 (a project originally named the Intel 8800), the Intel 960, Intel 860 and the Intel/Hewlett-Packard Itanium architecture. However, the continuous refinement of x86 microarchitectures, circuitry and semiconductor manufacturing would make it hard to replace x86 in many segments. AMD's 64-bit extension of x86 (which Intel eventually responded to with a compatible design) and the scalability of x86 chips in the form of modern multi-core CPUs, is underlining x86 as an example of how continuous refinement of established industry standards can resist the competition from completely new architectures.
For some advanced features, x86 may require a license from Intel, though some do not need it; x86-64 may require an additional license from AMD. The Pentium Pro processor (and NetBurst) has been on the market for more than 21 years and so cannot be subject to patent claims. The i686 subset of the x86 architecture is therefore fully open. The Opteron 1000 series processors have been on the market for more than 21 years and so cannot be subject to patent claims. The AMD K8 subset of the x86 architecture is therefore fully open.
+Chronology of x86 processors | |||||
16-bit ISA, IBM PC (8088), IBM PC/XT (8088) | |||||
8086-2 ISA, embedded (80186/80188) | |||||
protected mode, IBM PC/XT 286, IBM PC/AT | |||||
32-bit ISA, paging, IBM PS/2 | |||||
pipelining, on-die x87 FPU (486DX), on-die CPU cache | |||||
Superscalar, 64-bit databus, faster FPU, MMX (Pentium MMX), APIC, SMP | |||||
Discrete microarchitecture (Micro-operation translation) | |||||
dynamic execution | |||||
μ-op translation, conditional move instructions, dynamic execution, speculative execution, 3-way x86 superscalar, superscalar FPU, PAE, on-chip L2 cache | |||||
on-package (Pentium II) or on-die (Celeron) L2 Cache, SSE (Pentium III), Slot 1, Socket 370 or Slot 2 (Xeon) | |||||
3DNow!, 3-level cache system (K6-III) | |||||
MMX+, 3DNow!+, double-pumped bus, Slot A or Socket A | |||||
CMS powered x86 platform processor, VLIW-128 core, on-die memory controller, on-die PCI bridge logic | |||||
SSE2, Hyper-Threading (Northwood), NetBurst, quad-pumped bus, Trace Cache, Socket 478 | |||||
Micro-op fusion, XD bit (Dothan) (Intel Core "Yonah") | |||||
CMS 6.0.4, VLIW-256, NX bit, Hyper Transport | |||||
64-bit EPIC architecture, 128-bit VLIW instruction bundle, on-die hardware IA-32 H/W enabling x86 OSes & x86 applications (early generations), software IA-32 EL enabling x86 applications (Itanium 2), Itanium register files are remapped to x86 registers | |||||
x86-64 is the 64-bit extended architecture of x86, its Legacy Mode preserves the entire and unaltered x86 architecture. The native architecture of x86-64 processors: residing in the 64-bit Mode, lacks of access mode in segmentation, presenting 64-bit architectural-permit linear address space; an adapted IA-32 architecture residing in the Compatibility Mode alongside 64-bit Mode is provided to support most x86 applications | |||||
AMD64 (except some Sempron processors presented as purely x86 processors), on-die memory controller, HyperTransport, on-die dual-core (X2), AMD-V (Athlon 64 Orleans), Socket 754/939/940 or AM2 | |||||
EM64T (enabled on selected models of Pentium 4 and Celeron D), SSE3, 2nd gen. NetBurst pipelining, dual-core (on-die: Pentium D 8xx, on-chip: Pentium D 9xx), Intel VT (Pentium 4 6x2), socket LGA 775 | |||||
Intel 64 (<<== EM64T), SSSE3 (65 nm), wide dynamic execution, μ-op fusion, macro-op fusion in 16-bit and 32-bit mode, on-chip quad-core(Core 2 Quad), Smart Shared L2 Cache (Intel Core 2 "Merom") | |||||
Monolithic quad-core (X4)/triple-core (X3), SSE4a, Rapid Virtualization Indexing (RVI), HyperTransport 3, AM2+ or AM3 | |||||
SSE4.1 | |||||
netbook or low power smart device processor, P54C core reused | |||||
QuickPath, on-chip GMCH (Clarkdale), SSE4.2, Extended Page Tables (EPT) for virtualization, macro-op fusion in 64-bit mode, (Intel Xeon "Bloomfield" with Nehalem microarchitecture) | |||||
hardware-based encryption; adaptive power management | |||||
octa-core, CMT(Clustered Multi-Thread), FMA, OpenCL, AM3+ | |||||
on-die GPGPU, PCI Express 2.0, Socket FM1 | |||||
low power smart device APU | |||||
Internal Ring connection, decoded μ-op cache, LGA 1155 socket | |||||
AVX, Bulldozer based APU, Socket FM2 or Socket FM2+ | |||||
PCI-E add-on card coprocessor for XEON based system, Manycore Chip, In-order P54C, very wide VPU (512-bit SSE), LRBni instructions (8× 64-bit) | |||||
SoC, game console and low power smart device processor | |||||
SoC, low/ultra-low power smart device processor | |||||
AVX2, FMA3, TSX, BMI1, and BMI2 instructions, LGA 1150 socket | |||||
SoC, on-chip Broadwell-U PCH-LP (Multi-chip module) | |||||
AVX-512 (restricted to Cannon Lake-U and workstation/server variants of Skylake) | |||||
Manycore CPU and coprocessor for Xeon systems, Airmont (Atom) based core | |||||
Integrated FCH on die, SoC, AM4 socket | |||||
AMD's implementation of SMT, on-chip multiple dies | |||||
Zhaoxin's first brand new x86-64 architecture | |||||
Intel's first implementation of AVX-512 for the consumer segment. Addition of Vector Neural Network Instructions (VNNI) | |||||
2019 | AMD AMD Matisse | 48-bit | Multiple Chip Module design with I/O die separate from CPU die(s), Support for PCIe Gen4 | ||
Intel Willow Cove (Tiger Lake-Y/U/H) | Dual ring interconnect architecture, updated Gaussian Neural Accelerator (GNA2), new AVX-512 Vector Intersection Instructions, addition of Control-Flow Enforcement Technology (CET) | ||||
Hybrid design with performance (Golden Cove) and efficiency cores (Gracemont), support for PCIe Gen5 and DDR5, updated Gaussian Neural Accelerator (GNA3). AVX-512 not officially supported | |||||
2022 | AMD AMD Vermeer (5800X3D) | 48-bit | X3D chips have an additional 64MB 3D vertically stacked L3 cache (3D V-Cache) for up to 96MB L3 Cache | ||
2022 | AMD AMD Raphael | AMD's first implementation of AVX-512 for the consumer segment, iGPU now standard on Ryzen CPU's with 2 RDNA 2 compute cores |
Such x86 implementations were seldom simple copies but often employed different internal microarchitectures and different solutions at the electronic and physical levels. Quite naturally, early compatible microprocessors were 16-bit, while 32-bit designs were developed much later. For the personal computer market, real quantities started to appear around 1990 with i386 and i486 compatible processors, often named similarly to Intel's original chips.
After the fully pipelined i486, in 1993 Intel introduced the Pentium brand name (which, unlike numbers, could be ) for their new set of superscalar x86 designs. With the x86 naming scheme now legally cleared, other x86 vendors had to choose different names for their x86-compatible products, and initially some chose to continue with variations of the numbering scheme: IBM partnered with Cyrix to produce the 5x86 and then the very efficient 6x86 (M1) and 6x86MX (MII) lines of Cyrix designs, which were the first x86 microprocessors implementing register renaming to enable speculative execution.
AMD meanwhile designed and manufactured the advanced but delayed 5k86 (K5), which, internally, was closely based on AMD's earlier 29K RISC design; similar to NexGen's Nx586, it used a strategy such that dedicated pipeline stages decode x86 instructions into uniform and easily handled , a method that has remained the basis for most x86 designs to this day.
Some early versions of these microprocessors had heat dissipation problems. The 6x86 was also affected by a few minor compatibility problems, the Nx586 lacked a floating-point unit (FPU) and (the then crucial) pin-compatibility, while the K5 had somewhat disappointing performance when it was (eventually) introduced.
Customer ignorance of alternatives to the Pentium series further contributed to these designs being comparatively unsuccessful, despite the fact that the K5 had very good Pentium compatibility and the 6x86 was significantly faster than the Pentium on integer code. AMD later managed to grow into a serious contender with the K6 set of processors, which gave way to the very successful Athlon and Opteron.
There were also other contenders, such as Centaur Technology (formerly IDT), Rise Technology, and Transmeta. VIA Technologies' energy efficient C3 and C7 processors, which were designed by the Centaur company, were sold for many years following their release in 2005. Centaur's 2008 design, the VIA Nano, was their first processor with superscalar and speculative execution. It was introduced at about the same time (in 2008) as Intel introduced the Intel Atom, its first "in-order" processor after the P5 Pentium.
Many additions and extensions have been added to the original x86 instruction set over the years, almost consistently with full backward compatibility. The architecture family has been implemented in processors from Intel, Cyrix, AMD, VIA Technologies and many other companies; there are also open implementations, such as the Zet SoC platform (currently inactive). Nevertheless, of those, only Intel, AMD, VIA Technologies, and DM&P Electronics hold x86 architectural licenses, and from these, only the first two actively produce modern 64-bit designs, leading to what has been called a "duopoly" of Intel and AMD in x86 processors.
However, in 2014 the Shanghai-based Chinese company Zhaoxin, a joint venture between a Chinese company and VIA Technologies, began designing VIA based x86 processors for desktops and laptops. The release of its newest "7" family of x86 processors (e.g. KX-7000), which are not quite as fast as AMD or Intel chips but are still state of the art, had been planned for 2021; as of March 2022 the release had not taken place, however.
In 1999–2003, AMD extended this 32-bit architecture to 64 bits and referred to it as x86-64 in early documents and later as AMD64. Intel soon adopted AMD's architectural extensions under the name IA-32e, later using the name EM64T and finally using Intel 64. Microsoft and Sun Microsystems/Oracle also use term "x64", while many Linux distributions, and the also use the "amd64" term. Microsoft Windows, for example, designates its 32-bit versions as "x86" and 64-bit versions as "x64", while installation files of 64-bit Windows versions are required to be placed into a directory called "AMD64".
The draft specification received multiple updates, reaching version 1.2 by June 2024. It was eventually abandoned as of December 2024, following the formation of the x86 Ecosystem Advisory Group by Intel and AMD.
A processor implementing this proposal would have lacked support for legacy mode, started execution directly in long mode and provided a way to switch to 5-level paging without going through the unpaged mode.
The new architecture would have removed support for 16-bit and 32-bit operating systems. 32-bit code would have only been supported for user applications running in ring 3, and would have used the same simplified segmentation as long mode.
Specific removed features would have included:
To further conserve encoding space, most registers are expressed in using three or four bits, the latter via an opcode prefix in 64-bit mode, while at most one operand to an instruction can be a memory location. However, this memory operand may also be the destination (or a combined source and destination), while the other operand, the source, can be either register or immediate. Among other factors, this contributes to a code size that rivals eight-bit machines and enables efficient use of instruction cache memory. The relatively small number of general registers (also inherited from its 8-bit ancestors) has made register-relative addressing (using small immediate offsets) an important method of accessing operands, especially on the stack. Much work has therefore been invested in making such accesses as fast as register accesses—i.e., a one cycle instruction throughput, in most circumstances where the accessed data is available in the top-level cache.
The presence of wide SIMD registers means that existing x86 processors can load or store up to 128 bits of memory data in a single instruction and also perform bitwise operations (although not integer arithmetic) on full 128-bits quantities in parallel. Intel's Sandy Bridge processors added the Advanced Vector Extensions (AVX) instructions, widening the SIMD registers to 256 bits. The Intel Initial Many Core Instructions implemented by the Knights Corner Xeon Phi processors, and the AVX-512 instructions implemented by the Knights Landing Xeon Phi processors and by Skylake-X processors, use 512-bit wide SIMD registers.
When introduced, in the mid-1990s, this method was sometimes referred to as a "RISC core" or as "RISC translation", partly for marketing reasons, but also because these micro-operations share some properties with certain types of RISC instructions. However, traditional microcode (used since the 1950s) also inherently shares many of the same properties; the new method differs mainly in that the translation to micro-operations now occurs asynchronously. Not having to synchronize the execution units with the decode steps opens up possibilities for more analysis of the (buffered) code stream, and therefore permits detection of operations that can be performed in parallel, simultaneously feeding more than one execution unit.
The latest processors also do the opposite when appropriate; they combine certain x86 sequences (such as a compare followed by a conditional jump) into a more complex micro-op which fits the execution model better and thus can be executed faster or with fewer machine resources involved.
Another way to try to improve performance is to cache the decoded micro-operations, so the processor can directly access the decoded micro-operations from a special cache, instead of decoding them again. Intel followed this approach with the Execution Trace Cache feature in their NetBurst microarchitecture (for Pentium 4 processors) and later in the Decoded Stream Buffer (for Core-branded processors since Sandy Bridge).
Transmeta used a completely different method in their Transmeta Crusoe x86 compatible CPUs. They used just-in-time translation to convert x86 instructions to the CPU's native VLIW instruction set. Transmeta argued that their approach allows for more power efficient designs since the CPU can forgo the complicated decode step of more traditional x86 implementations.
\mathtt{CS}: \\ \mathtt{DS}: \\ \mathtt{SS}: \\ \mathtt{ES}:\end{matrix}\ \ \begin{matrix} \\
\begin{bmatrix} \mathtt{BX} \\ \mathtt{BP} \end{bmatrix} + \begin{bmatrix} \mathtt{SI} \\ \mathtt{DI} \end{bmatrix} \\ \\\end{matrix} + \rm displacement
Addressing modes for 32-bit x86 processor modes can be summarized by the formula:
\mathtt{CS}: \\ \mathtt{DS}: \\ \mathtt{SS}: \\ \mathtt{ES}: \\ \mathtt{FS}: \\ \mathtt{GS}:\end{matrix}\ \ \begin{bmatrix}
\mathtt{EAX} \\ \mathtt{EBX} \\ \mathtt{ECX} \\ \mathtt{EDX} \\ \mathtt{ESP} \\ \mathtt{EBP} \\ \mathtt{ESI} \\ \mathtt{EDI}\end{bmatrix} + \begin{pmatrix}\\
\begin{bmatrix} \mathtt{EAX} \\ \mathtt{EBX} \\ \mathtt{ECX} \\ \mathtt{EDX} \\ \mathtt{EBP} \\ \mathtt{ESI} \\ \mathtt{EDI} \end{bmatrix} * \begin{bmatrix} 1 \\ 2 \\ 4 \\ 8 \end{bmatrix} \\ \\\end{pmatrix} + \rm displacement
Addressing modes for the 64-bit processor mode can be summarized by the formula:
\begin{matrix} \mathtt{FS}: \\ \mathtt{GS}: \end{matrix}\ \ \begin{bmatrix} \vdots \\ \mathtt{GPR} \\ \vdots \end{bmatrix} + \begin{pmatrix} \\ \begin{bmatrix} \vdots \\ \mathtt{GPR} \\ \vdots \\ \end{bmatrix} * \begin{bmatrix} 1\\2\\4\\8 \end{bmatrix} \\ \\ \end{pmatrix} \\ \\ \hline \\ \begin{matrix} \mathtt{RIP} \end{matrix} \\ \\\end{Bmatrix} + \rm displacement
Instruction relative addressing in 64-bit code (RIP + displacement, where RIP is the instruction pointer register) simplifies the implementation of position-independent code (as used in shared libraries in some operating systems).
The 8086 had of eight-bit (or alternatively ) I/O space, and a (one segment) stack in memory supported by computer hardware. Only words (two bytes) can be pushed to the stack. The stack grows toward numerically lower addresses, with pointing to the most recently pushed item. There are 256 , which can be invoked by both hardware and software. The interrupts can cascade, using the stack to store the Return statement.
One of four possible 'segment registers' (CS, DS, SS and ES) is used to form a memory address. In the original 8086 / 8088 / 80186 / 80188 every address was built from a segment register and one of the general purpose registers. For example ds:si is the notation for an address formed as 16 to allow 20-bit addressing rather than 16 bits, although this changed in later processors. At that time only certain combinations were supported.
The FLAGS register contains flags such as carry flag, overflow flag and zero flag. Finally, the instruction pointer (IP) points to the next instruction that will be fetched from memory and then executed; this register cannot be directly accessed (read or written) by a program.
The Intel 80186 and 80188 are essentially an upgraded 8086 or 8088 CPU, respectively, with on-chip peripherals added, and they have the same CPU registers as the 8086 and 8088 (in addition to interface registers for the peripherals).
The 8086, 8088, 80186, and 80188 can use an optional floating-point coprocessor, the 8087. The 8087 appears to the programmer as part of the CPU and adds eight 80-bit wide registers, st(0) to st(7), each of which can hold numeric data in one of seven formats: 32-, 64-, or 80-bit floating point, 16-, 32-, or 64-bit (binary) integer, and 80-bit packed decimal integer. It also has its own 16-bit status register accessible through the instruction, and it is common to simply use some of its bits for branching by copying it into the normal FLAGS.
In the Intel 80286, to support protected mode, three special registers hold descriptor table addresses (GDTR, LDTR, IDTR), and a fourth task register (TR) is used for task switching. The 80287 is the floating-point coprocessor for the 80286 and has the same registers as the 8087 with the same data formats.
Two new segment registers (FS and GS) were added. With a greater number of registers, instructions and operands, the machine code format was expanded. To provide backward compatibility, segments with executable code can be marked as containing either 16-bit or 32-bit instructions. Special prefixes allow inclusion of 32-bit instructions in a 16-bit segment or vice versa.
The 80386 had an optional floating-point coprocessor, the 80387; it had eight 80-bit wide registers: st(0) to st(7), like the 8087 and 80287. The 80386 could also use an 80287 coprocessor. With the 80486 and all subsequent x86 models, the floating-point processing unit (FPU) is integrated on-chip.
The Pentium MMX added eight 64-bit MMX integer vector registers (MM0 to MM7, which share lower bits with the 80-bit-wide FPU stack). With the Pentium III, Intel added a 32-bit Streaming SIMD Extensions (SSE) control/status register (MXCSR) and eight 128-bit SSE floating-point registers (XMM0 to XMM7).
32-bit x86 processors (starting with the 80386) also include various special/miscellaneous registers such as (CR0 through 4, CR8 for 64-bit only), (DR0 through 3, plus 6 and 7), (TR3 through 7; 80486 only), and model-specific registers (MSRs, appearing with the Pentium).
AVX-512 has eight extra 64-bit mask registers K0–K7 for selecting elements in a vector register. Depending on the vector register and element widths, only a subset of bits of the mask register may be used by a given instruction.
Segment registers:
No particular purposes were envisioned for the other 8 registers available only in 64-bit mode.
Some instructions compile and execute more efficiently when using these registers for their designed purpose. For example, using AL as an accumulator and adding an immediate byte value to it produces the efficient add to AL opcode of 04h, whilst using the BL register produces the generic and longer add to register opcode of 80C3h. Another example is double precision division and multiplication that works specifically with the AX and DX registers.
Modern compilers benefited from the introduction of the sib byte ( scale-index-base byte) that allows registers to be treated uniformly (minicomputer-like). However, using the sib byte universally is non-optimal, as it produces longer encodings than only using it selectively when necessary. (The main benefit of the sib byte is the orthogonality and more powerful addressing modes it provides, which make it possible to save instructions and the use of registers for address calculations such as scaling an index.) Some special instructions lost priority in the hardware design and became slower than equivalent small code sequences. A notable example is the LODSW instruction.
+ General Purpose Registers (A, B, C and D) ! style="width:50pt;" | 64 ! style="width:50pt;" | 56 ! style="width:50pt;" | 48 ! style="width:50pt;" | 40 ! style="width:50pt;" | 32 ! style="width:50pt;" | 24 ! style="width:50pt;" | 16 ! style="width:50pt;" | 8 |
R?X | ||||||||
E?X | ||||||||
?X | ||||||||
?H | ?L |
+ 64-bit mode-only General Purpose Registers (R8, R9, R10, R11, R12, R13, R14, R15) ! style="width:50pt;" | 64 ! style="width:50pt;" | 56 ! style="width:50pt;" | 48 ! style="width:50pt;" | 40 ! style="width:50pt;" | 32 ! style="width:50pt;" | 24 ! style="width:50pt;" | 16 ! style="width:50pt;" | 8 |
? | ||||||||
?D | ||||||||
?W | ||||||||
?B |
+ Segment Registers (C, D, S, E, F and G) ! style="width:50pt;" | 16 ! style="width:50pt;" | 8 |
?S |
+ Pointer Registers (S and B) ! style="width:50pt;" | 64 ! style="width:50pt;" | 56 ! style="width:50pt;" | 48 ! style="width:50pt;" | 40 ! style="width:50pt;" | 32 ! style="width:50pt;" | 24 ! style="width:50pt;" | 16 ! style="width:50pt;" | 8 |
R?P | ||||||||
E?P | ||||||||
?P | ||||||||
?PL |
+ Index Registers (S and D) ! style="width:50pt;" | 64 ! style="width:50pt;" | 56 ! style="width:50pt;" | 48 ! style="width:50pt;" | 40 ! style="width:50pt;" | 32 ! style="width:50pt;" | 24 ! style="width:50pt;" | 16 ! style="width:50pt;" | 8 |
R?I | ||||||||
E?I | ||||||||
?I | ||||||||
?IL |
+ Instruction Pointer Register (I) ! style="width:50pt;" | 64 ! style="width:50pt;" | 56 ! style="width:50pt;" | 48 ! style="width:50pt;" | 40 ! style="width:50pt;" | 32 ! style="width:50pt;" | 24 ! style="width:50pt;" | 16 ! style="width:50pt;" | 8 |
RIP | ||||||||
EIP | ||||||||
IP |
In order to use more than 64 KB of memory, the segment registers must be used. This created great complications for compiler implementors who introduced odd pointer modes such as "near", "far" and "huge" to leverage the implicit nature of segmented architecture to different degrees, with some pointers containing 16-bit offsets within implied segments and other pointers containing segment addresses and offsets within segments. It is technically possible to use up to 256 KB of memory for code and data, with up to 64 KB for code, by setting all four segment registers once and then only using 16-bit offsets (optionally with default-segment override prefixes) to address memory, but this puts substantial restrictions on the way data can be addressed and memory operands can be combined, and it violates the architectural intent of the Intel designers, which is for separate data items (e.g. arrays, structures, code units) to be contained in separate segments and addressed by their own segment addresses, in new programs that are not ported from earlier 8-bit processors with 16-bit address spaces.
Each time a segment register is loaded in protected mode, the 80286 must read a 6-byte segment descriptor from memory into a set of hidden internal registers. Thus, loading segment registers is much slower in protected mode than in real mode, and changing segments very frequently is to be avoided. Actual memory operations using protected mode segments are not slowed much because the 80286 and later have hardware to check the offset against the segment limit in parallel with instruction execution.
The Intel 80386 extended offsets and also the segment limit field in each segment descriptor to 32 bits, enabling a segment to span the entire memory space. It also introduced support in protected mode for paging, a mechanism making it possible to use paged virtual memory (with 4 KB page size). Paging allows the CPU to map any page of the virtual memory space to any page of the physical memory space. To do this, it uses additional mapping tables in memory called page tables. Protected mode on the 80386 can operate with paging either enabled or disabled; the segmentation mechanism is always active and generates virtual addresses that are then mapped by the paging mechanism if it is enabled. The segmentation mechanism can also be effectively disabled by setting all segments to have a base address of 0 and size limit equal to the whole address space; this also requires a minimally-sized segment descriptor table of only four descriptors (since the FS and GS segments need not be used).
Paging is used extensively by modern multitasking operating systems. Linux, 386BSD and Windows NT were developed for the 386 because it was the first Intel architecture CPU to support paging and 32-bit segment offsets. The 386 architecture became the basis of all further development in the x86 series.
x86 processors that support protected mode boot into real mode for backward compatibility with the older 8086 class of processors. Upon power-on (a.k.a. booting), the processor initializes in real mode, and then begins executing instructions. Operating system boot code, which might be stored in read-only memory, may place the processor into the protected mode to enable paging and other features. Conversely, segment arithmetic, a common practice in real mode code, is not allowed in protected mode.
In 1999, AMD published a (nearly) complete specification for a 64-bit extension of the x86 architecture which they called x86-64 with claimed intentions to produce. That design is currently used in almost all x86 processors, with some exceptions intended for .
Mass-produced x86-64 chips for the general market were available four years later, in 2003, after the time was spent for working prototypes to be tested and refined; about the same time, the initial name x86-64 was changed to AMD64. The success of the AMD64 line of processors coupled with lukewarm reception of the IA-64 architecture forced Intel to release its own implementation of the AMD64 instruction set. Intel had previously implemented support for AMD64 but opted not to enable it in hopes that AMD would not bring AMD64 to market before Itanium's new IA-64 instruction set was widely adopted. It branded its implementation of AMD64 as EM64T, and later rebranded it Intel 64.
In its literature and product version names, Microsoft and Sun refer to AMD64/Intel 64 collectively as x64 in the Windows and Solaris operating systems. Linux distributions refer to it either as "x86-64", its variant "x86_64", or "amd64". BSD systems use "amd64" while macOS uses "x86_64".
Long mode is mostly an extension of the 32-bit instruction set, but unlike the 16–to–32-bit transition, many instructions were dropped in the 64-bit mode. This does not affect actual binary backward compatibility (which would execute legacy code in other modes that retain support for those instructions), but it changes the way assembler and compilers for new code have to work.
This was the first time that a major extension of the x86 architecture was initiated and originated by a manufacturer other than Intel. It was also the first time that Intel accepted technology of this nature from an outside source.
Each x87 register, known as ST(0) through ST(7), is 80 bits wide and stores numbers in the IEEE floating-point standard double extended precision format. These registers are organized as a stack with ST(0) as the top. This was done in order to conserve opcode space, and the registers are therefore randomly accessible only for either operand in a register-to-register instruction; ST0 must always be one of the two operands, either the source or the destination, regardless of whether the other operand is ST(x) or a memory operand. However, random access to the stack registers can be obtained through an instruction which exchanges any specified ST(x) with ST(0).
The operations include arithmetic and transcendental functions, including trigonometric and exponential functions, and instructions that load common constants (such as 0; 1; e, the base of the natural logarithm; log2(10); and log10(2)) into one of the stack registers. While the integer ability is often overlooked, the x87 can operate on larger integers with a single instruction than the 8086, 80286, 80386, or any x86 CPU without to 64-bit extensions can, and repeated integer calculations even on small values (e.g., 16-bit) can be accelerated by executing integer instructions on the x86 CPU and the x87 in parallel. (The x86 CPU keeps running while the x87 coprocessor calculates, and the x87 sets a signal to the x86 when it is finished or interrupts the x86 if it needs attention because of an error.)
MMX added 8 new registers to the architecture, known as MM0 through MM7 (henceforth referred to as MMn). In reality, these new registers were just aliases for the existing x87 FPU stack registers. Hence, anything that was done to the floating-point stack would also affect the MMX registers. Unlike the FP stack, these MMn registers were fixed, not relative, and therefore they were randomly accessible. The instruction set did not adopt the stack-like semantics so that existing operating systems could still correctly save and restore the register state when multitasking without modifications.
Each of the MMn registers are 64-bit integers. However, one of the main concepts of the MMX instruction set is the concept of packed data types, which means instead of using the whole register for a single 64-bit integer (quadword), one may use it to contain two 32-bit integers (doubleword), four 16-bit integers (word) or eight 8-bit integers (byte). Given that the MMX's 64-bit MMn registers are aliased to the FPU stack and each of the floating-point registers are 80 bits wide, the upper 16 bits of the floating-point registers are unused in MMX. These bits are set to all ones by any MMX instruction, which correspond to the floating-point representation of or infinities.
3DNow! was designed to be the natural evolution of MMX from integers to floating point. As such, it uses exactly the same register naming convention as MMX, that is MM0 through MM7. The only difference is that instead of packing integers into these registers, two single-precision floating-point numbers are packed into each register. The advantage of aliasing the FPU registers is that the same instruction and data structures used to save the state of the FPU registers can also be used to save 3DNow! register states. Thus no special modifications are required to be made to operating systems which would otherwise not know about them.
SSE discarded all legacy connections to the FPU stack. This also meant that this instruction set discarded all legacy connections to previous generations of SIMD instruction sets like MMX. But it freed the designers up, allowing them to use larger registers, not limited by the size of the FPU registers. The designers created eight 128-bit registers, named XMM0 through XMM7. (In AMD64, the number of SSE XMM registers has been increased from 8 to 16.) However, the downside was that operating systems had to have an awareness of this new set of instructions in order to be able to save their register states. So Intel created a slightly modified version of Protected mode, called Enhanced mode which enables the usage of SSE instructions, whereas they stay disabled in regular Protected mode. An OS that is aware of SSE will activate Enhanced mode, whereas an unaware OS will only enter into traditional Protected mode.
SSE is a SIMD instruction set that works only on floating-point values, like 3DNow!. However, unlike 3DNow! it severs all legacy connection to the FPU stack. Because it has larger registers than 3DNow!, SSE can pack twice the number of single precision floats into its registers. The original SSE was limited to only single-precision numbers, like 3DNow!. The SSE2 introduced the capability to pack double precision numbers too, which 3DNow! had no possibility of doing since a double precision number is 64-bit in size which would be the full size of a single 3DNow! MMn register. At 128 bits, the SSE XMMn registers could pack two double precision floats into one register. Thus SSE2 is much more suitable for scientific calculations than either SSE1 or 3DNow!, which were limited to only single precision. SSE3 does not introduce any additional registers.
In 2001, Intel attempted to introduce a non-x86 64-bit architecture named IA-64 in its Itanium processor, initially aiming for the high-performance computing market, hoping that it would eventually replace the 32-bit x86. While IA-64 was incompatible with x86, the Itanium processor did provide Emulator abilities for translating x86 instructions into IA-64, but this affected the performance of x86 programs so badly that it was rarely, if ever, actually useful to the users: programmers should rewrite x86 programs for the IA-64 architecture or their performance on Itanium would be orders of magnitude worse than on a true x86 processor. The market rejected the Itanium processor since it broke backward compatibility and preferred to continue using x86 chips, and very few programs were rewritten for IA-64.
AMD decided to take another path toward 64-bit memory addressing, making sure backward compatibility would not suffer. In April 2003, AMD released the first x86 processor with 64-bit general-purpose registers, the Opteron, capable of addressing much more than 4 Gigabyte of virtual memory using the new x86-64 extension (also known as AMD64 or x64). The 64-bit extensions to the x86 architecture were enabled only in the newly introduced long mode, therefore 32-bit and 16-bit applications and operating systems could simply continue using an AMD64 processor in protected or other modes, without even the slightest sacrifice of performance and with full compatibility back to the original instructions of the 16-bit Intel 8086. The market responded positively, adopting the 64-bit AMD processors for both high-performance applications and business or home computers.
Seeing the market rejecting the incompatible Itanium processor and Microsoft supporting AMD64, Intel had to respond and introduced its own x86-64 processor, the Prescott Pentium 4, in July 2004. As a result, the Itanium processor with its IA-64 instruction set is rarely used and x86, through its x86-64 incarnation, is still the dominant CPU architecture in non-embedded computers.
x86-64 also introduced the NX bit, which offers some protection against security bugs caused by .
As a result of AMD's 64-bit contribution to the x86 lineage and its subsequent acceptance by Intel, the 64-bit RISC architectures ceased to be a threat to the x86 ecosystem and almost disappeared from the workstation market. x86-64 began to be utilized in powerful (in its AMD Opteron and Intel Xeon incarnations), a market which was previously the natural habitat for 64-bit RISC designs (such as the IBM Power microprocessors or SPARC processors). The great leap toward 64-bit computing and the maintenance of backward compatibility with 32-bit and 16-bit software enabled the x86 architecture to become an extremely flexible platform today, with x86 chips being utilized from small low-power systems (for example, Intel Quark and Intel Atom) to fast gaming desktop computers (for example, Intel Core i7 and AMD FX/Ryzen), and even dominate large supercomputing computer cluster, effectively leaving only the ARM 32-bit and 64-bit RISC architecture as a competitor in the smartphone and tablet computer market.
The introduction of the AMD-V and Intel VT-x instruction sets in 2005 allowed x86 processors to meet the Popek and Goldberg virtualization requirements.
AVX-512 features yet another expansion to 32 512-bit ZMM registers and a new EVEX scheme. Unlike its predecessors featuring a monolithic extension, it is divided into many subsets that specific models of CPUs can choose to implement.
According to the architecture specification, the main features of APX are:
Extended GPRs for general purpose instructions are encoded using a 2-byte REX prefix prefix, while new instructions and extended operands for existing AVX/AVX2/AVX-512 instructions are encoded with an extended EVEX prefix which has four variants used for different groups of instructions.
|
|