X86 Assembly/AVX, AVX2, FMA3, FMA4

Prerequisites: X86 Assembly/SSE.

Example FMA4 program

edit

The following program shows the use of the FMA4 instruction vfmaddps that can be used to do 8 single precision floating point multiplication and additions in one instruction.

.data
        #          2^-1  2^-2  2^-3   2^-4    2^-5     2^-6      2^-7       2^-8
        v1: .float 0.50, 0.25, 0.125, 0.0625, 0.03125, 0.015625, 0.0078125, 0.00390625
        v2: .float 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0
        v3: .float 512.0, 1024.0, 2048.0, 8192.0, 16384.0, 32768.0, 65536.0, 131072.0
        v4: .float 0,0,0,0,0,0,0,0
.text
.globl _start
        _start:
        vmovups v1,%ymm0
        vmovups v2,%ymm1
        vmovups v3,%ymm2
        #        addend + multiplicant1   * multiplicant2   = destination
        vfmaddps %ymm0,   %ymm1,            %ymm2,            %ymm3
        vmovups %ymm3, v4

If you set a debugger breakpoint after the last line, you can use GDB to analyze the result. Look at the program and try and spot any problems.

Spoiler alert. Dumping the result "vector" in binary, we can see that precision has been lost.

(gdb) x/8t &v4
0x80490fc <v4>:    01000100100000000001000000000000	01000101100000000000001000000000	01000110100000000000000001000000	01001000000000000000000000000100
0x804910c <v4+16>: 01001001000000000000000000000000	01001010000000000000000000000000	01001011000000000000000000000000	01001100000000000000000000000000
(gdb)

Comparing v4+12 to v4+16, one can see that the addend got too small, and was lost. We only halved it from +12 to +16, so why is it gone now? The reason is that the exponent was changed too, so the addend would be placed at a bit more than 1 bit less significant than the last set bit in the previous mantissa. And so, it got so tiny that a 32-bit single-precision float could not represent it. The data loss is also visible when dumping the floats in their base-10 representations, but one must be careful, because the base-10 representation isn't always faithful.

(gdb) x/8f &v4
0x80490fc <v4>:    1024.5	4096.25	16384.125	131072.062
0x804910c <v4+16>: 524288	2097152	8388608		33554432
(gdb)

Resources

edit