In
computing, especially digital signal processing, the
multiply–accumulate operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a
multiplier–accumulator (
MAC, or
MAC unit); the operation itself is also often called a MAC or a MAC operation. The MAC operation modifies an accumulator
a:
 $\backslash \; a\; \backslash leftarrow\; a\; +\; (\; b\; \backslash times\; c\; )$
When done with floating point numbers, it might be performed with two (typical in many DSPs), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add ( FMA) or fused multiply–accumulate ( FMAC).
Modern computers may contain a dedicated MAC, consisting of a multiplier implemented in combinational logic followed by an adder and an accumulator register that stores the result. The output of the register is fed back to one input of the adder, so that on each clock cycle, the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers. The first processors to be equipped with MAC units were digital signal processors, but the technique is now also common in generalpurpose processors.
In floatingpoint arithmetic
When done with
, the operation is typically exact (computed modulo some power of two). However,
floatingpoint numbers have only a certain amount of mathematical precision. That is, digital floatingpoint arithmetic is generally not
associativity or
distributivity. (See Floating point#Accuracy problems.)
Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add). IEEE 7542008 specifies that it must be performed with one rounding, yielding a more accurate result.
Fused multiply–add
A
fused multiply–add (sometimes known as
FMA or
fmadd)
is a floatingpoint multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product
b×
c, round it to
N significant bits, add the result to
a, and round back to
N significant bits, a fused multiply–add would compute the entire expression
a+
b×
c to its full precision before rounding the final result down to
N significant bits.
A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products:

Dot product

Matrix multiplication

Polynomial evaluation (e.g., with Horner's rule)

Newton's method for evaluating functions.

Convolutions and artificial neural networks
Fused multiply–add can usually be relied on to give more accurate results. However, William Kahan has pointed out that it can give problems if used unthinkingly. If is evaluated as using fused multiply–add, then the result may be negative even when due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated.
When implemented inside a microprocessor, an FMA can actually be faster than a multiply operation followed by an add. However, standard industrial implementations based on the original IBM RS/6000 design require a 2 Nbit adder to compute the sum properly.
A useful benefit of including this instruction is that it allows an efficient software implementation of division (see division algorithm) and square root (see methods of computing square roots) operations, thus eliminating the need for dedicated hardware for those operations.
Dot product instruction
Some machines combine multiple fused multiply add operations into a single step, e.g. performing a fourelement dotproduct on two 128bit
SIMD registers
a0×
b0+
a1×
b1+
a2×
b2+
a3×
b3 with single cycle throughput.
Support
The FMA operation is included in IEEE 7542008.
The DEC VAX's POLY instruction is used for evaluating polynomials with Horner's rule using a succession of multiply and add steps. Instruction descriptions do not specify whether the multiply and add are performed using a single fma step. This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977.
The 1999 standard of the C programming language supports the FMA operation through the fma standard math library function, and standard pragmas controlling optimizations based on FMA.
The fused multiply–add operation was introduced as multiply–add fused in the IBM POWER1 (1990) processor, but has been added to numerous other processors since then:

HewlettPackard PA8000 (1996) and above

Hitachi SuperH SH4 (1998)

SCEToshiba Emotion Engine (1999)

Intel Itanium (2001)

STI Cell (2006)

Fujitsu SPARC64 VI (2007) and above

(MIPScompatible) Loongson2F (2008)

Elbrus8SV (2018)

x86 processors with FMA3 and/or FMA4 instruction set

AMD Bulldozer (2011, FMA4 only)

/ref>

AMD Steamroller (2014)

AMD Excavator (2015)

AMD Zen (2017, FMA3 only)

Intel Haswell (2013, FMA3 only)

Intel Skylake (2015, FMA3 only)

ARM processors with VFPv4 and/or NEONv2:

ARM CortexM4F (2010)

ARM CortexA5 (2012)

ARM CortexA7 (2013)

ARM CortexA15 (2012)

Qualcomm Krait (2012)

Apple A6 (2012)

All ARMv8 processors

GPUs and GPGPU boards:

Advanced Micro Devices GPUs (2009) and newer

TeraScale 2 "Evergreen"series based

Graphics Core Nextbased

NVidia GPUs (2010) and newer

Fermibased (2010)

Keplerbased (2012)

Maxwellbased (2014)

Pascalbased (2016)

Voltabased (2017)

Intel GPUs since Sandy Bridge

Intel MIC (2012)

ARM Mali T600 Series (2012) and above