Floating Point

Storage overview

We can think of a floating point number as having the form:

\[\mbox{significand} \times 10^\mbox{exponent}\]

Most computers follow the IEEE 754 standard for floating point, and we commonly work with 32-bit and 64-bit floating point numbers (single and double precision). These bits are split between the signifcand and exponent as well as a single bit for the sign:


Fig. 12 (source: wikipedia)

Since the number is stored in binary, we can think about expanding a number in powers of 2:

\[0.1 \sim (1 + 1 \cdot 2^{-1} + 0 \cdot 2^{-2} + 0 \cdot 2^{-3} + 1 \cdot 2^{-4} + 1 \cdot 2^{-5} + \ldots) \times 2^{-4}\]

In fact, 0.1 cannot be exactly represented in floating point:

Listing 52 simple_roundoff.cpp
#include <iostream>
#include <iomanip>

int main() {

    double a = 0.1;

    std::cout << std::setprecision(19) << a << std::endl;



With 52 bits for the significand, the smallest number compared to 1 we can represent is

\[2^{-52} \approx 2.22\times 10^{-16}\]

but the IEEE 754 format always expresses the significant such that the first bit is 1 (by adjusting the exponent) and then doesn’t need to store that 1, giving us an extra bit of precision, so the machine epsilon is

\[2^{-53} \approx 1.11\times 10^{-16}\]

We already saw how to access the limits of the data type via std::numeric_limits. When we looked at machine epsilon, we saw that for a double it was about \(1.1\times 10^{-16}\).

Note that this is a relative error, so for a number like 1000 we could only add 1.1e-13 to it before it became indistinguishable from 1000.

\[\mbox{relative roundoff error} = \frac{|\mbox{true number} - \mbox{computer representation} |} {|\mbox{true number}|} \le \epsilon\]

Note that there are varying definitions of machine epsilon which differ by a factor of 2.


Now consider the exponent, we use 11 bits to store it in double precision. Two are reserved for special numbers, so out of the 2048 possible exponent values, one is 0, and 2 are reserved, leaving 2045 to split between positive and negative exponents. These are set as:

\[2^{-1022} \mbox{ to } 2^{1023}\]

converting to base 10, this is

\[\sim 10^{-308} \mbox{ to } \sim 10^{308}\]

Reporting values

We can use std::numeric_limits<double> to query these floating point properties:

Listing 53 limits.cpp
#include <iostream>
#include <limits>

int main() {

    std::cout << "maximum double = " << std::numeric_limits<double>::max() << std::endl;
    std::cout << "maximum double base-10 exponent = " << std::numeric_limits<double>::max_exponent10 << std::endl;
    std::cout << "smallest (abs) double = " << std::numeric_limits<double>::min() << std::endl;
    std::cout << "minimim double base-10 exponent = " << std::numeric_limits<double>::min_exponent10 << std::endl;
    std::cout << "machine epsilon (double) = " << std::numeric_limits<double>::epsilon() << std::endl;


Roundoff vs. truncation error

Consider the Taylor expansion of \(f(x)\) about some point \(x_0\):

\[f(x) = f(x_0 + \Delta x) = f(x_0) + \left . \frac{df}{dx} \right |_{x_0} \Delta x + \mathcal{O}(\Delta x^2)\]

where \(\Delta x = x - x_0\)

We can solve for the derivative to find an approximation for the first derivative:

\[\left . \frac{df}{dx} \right |_{x_0} = \frac{f(x_0 + \Delta x) - f(x_0)}{\Delta x} + \mathcal{O}(\Delta x)\]

This shows that this approximation for the derivative is first-order accurate in \(\Delta x\) – that is the truncation error of the approximation.

We can see the relative size of roundoff and truncation error by using this approximation to compute a derivative for different values of \(\Delta x\):

Listing 54 truncation_vs_roundoff.cpp
#include <iostream>
#include <iomanip>
#include <cmath>
#include <limits>
#include <vector>

double f(double x) {
    return std::sin(x);

double dfdx_true(double x) {
    return std::cos(x);

struct point {
    double dx;
    double err;

int main() {

    double dx = 0.1;
    double x0 = 1.0;

    std::vector<point> data;

    while (dx > std::numeric_limits<double>::epsilon()) {

        point p;
        p.dx = dx;

        double dfdx_approx = (f(x0 + dx) - f(x0)) / dx;
        double err = std::abs(dfdx_approx - dfdx_true(x0));

        p.err = err;


        dx /= 2.0;

    std::cout << std::setprecision(8) << std::scientific;
    for (auto p : data) {
        std::cout << std::setw(10) << p.dx << std::setw(15) << p.err << std::endl;

It is easier to see the behavior if we make a plot of the output:


Let’s discuss the trends:

  • Starting with the largest value of \(\Delta x\), as we make \(\Delta x\) smaller, we see that the error decreases. This is following the expected behavior of the truncation error derived above.

  • Once our \(\Delta x\) becomes really small, roundoff error starts to dominate. In effect, we are seeing that:

    \[(x_0 + \Delta x) - x_0 \ne 0\]

    because of roundoff error.

  • The minimum error here is around \(\sqrt{\epsilon}\), where \(\epsilon\) is machine epsilon.

Testing for equality

Because of roundoff error, we should never exactly compare two floating point numbers, but instead ask they they agree within some tolerance, e.g., test equality as:

\[| x - y | < \epsilon\]

For example:

Listing 55 comparing.cpp
#include <iostream>

int main() {

    double h{0.01};
    double sum{0.0};

    for (int n = 0; n < 100; ++n) {
        sum += h;

    std::cout << "sum == 1:        " << (sum == 1.0) << std::endl;
    std::cout << "|sum - 1| < tol: " << (std::abs(sum - 1.0) < 1.e-10) << std::endl;


Minimizing roundoff

Consider subtracting the square of two numbers – taking the difference of two very close-in-value numbers is a prime place where roundoff can come into play.

Instead of doing:

\[x^2 - y^2\]

we can instead do:

\[(x - y)(x + y)\]

by factoring this, we are subtracting more reasonably sized numbers, reducing the roundoff.

We can see this directly by doing this with single precision (float) and comparing to an answer computed via double precious (double)

Here’s an example:

Listing 56 subtraction.cpp
#include <iostream>

int main() {

    float x{1.000001e15};
    float y{1.0000e15};
    std::cout << "x^2 - y^2 :      " << x*x - y*y << std::endl;
    std::cout << "(x - y)(x + y) : " << (x - y) * (x + y) << std::endl;

    double m{1.000001e15};
    double n{1.0000e15};

    std::cout << "double precision value: " << (m - n) * (m + n) << std::endl;


As another example, consider computing 1:

\[\sqrt{x + 1} - \sqrt{x}\]

We can alternately rewrite this to avoid the subtraction of two close numbers:

\[\sqrt{x + 1} - \sqrt{x} = (\sqrt{x + 1} - \sqrt{x}) \left ( \frac{\sqrt{x+1} + \sqrt{x}}{\sqrt{x+1} + \sqrt{x}} \right ) = \frac{1}{\sqrt{x+1} + \sqrt{x}}\]

Again we’ll compare a single-precision calculation using each of these methods to a double precision “correct answer”. To ensure that we use the single-precision version of the std::sqrt() function, we will use single precision literal suffix, e.g., 1.0f tells the compiler that this is a single-precision constant.

Listing 57 squareroots.cpp
#include <iostream>
#include <iomanip>

#include <cmath>

int main() {

    float x{201010.0f};
    std::cout << std::setprecision(10);

    float r1 = std::sqrt(x + 1.0f) - std::sqrt(x);

    float r2 = 1.0f / (std::sqrt(x + 1.0f) + std::sqrt(x));

    std::cout << "r1 = " << r1 << " r2 = " << r2 << std::endl;
    double xd{201010.0};

    double d = std::sqrt(xd + 1.0) - std::sqrt(xd);

    std::cout << "d = " << d << std::endl;

Notice that we get several more significant digits correct when we compute it with the second formulation compared to the original form.

Summation algorithms

Summing a sequence of numbers is a common place where roundoff error comes into play, especially if the numbers all vary in magnitude and you do not attempt to add them in a sorted fashion. There are a number of different summation algorithms that keep track of the loss due to roundoff and compensate the sum, for example the Kahan summation algorithm.

Special numbers

IEEE 754 defines a few special quantities:

  • NaN (not a number) is the result of 0.0/0.0 or std::sqrt(-1.0)

  • Inf (infinity) is the result of 1.0/0.0

  • -0 is a valid number and the standard says that -0 is equivalent to 0

Trapping floating point exceptions

What happens when we do something bad? Consider this example:

Listing 58 undefined.cpp
#include <iostream>
#include <cmath>

double trouble(double x) {
    return std::sqrt(x);

int main() {

    double x{-1};

    double y = trouble(x);
    for (int i = 0; i < 10; ++i) {
        y += std::pow(x, i);

    std::cout << y << std::endl;


Here, we pass -1 to trouble() which then takes the square root of it – this results in a NaN. But if we run the code, it goes merrily about its way, using that result in the later computations.

Unix uses signals to indicate that a problem has happened during the code execution. If a program created a signal handler then that signal can be trapped and any desired action can be taken.


This example was only tested on a Linux machine with GCC. Other OSes or compilers might have slightly different headers or functionality.

There are a few parts to trapping a floating point exception (FPE). First we need to enable exception trapping via:


That catches 3 different types of floating point exceptions – invalid, divide-by-zero, and overflows.

Next we need to add a handler to deal with the exception:

signal(SIGFPE, fpe_handler);

Here, SIGFPE is the standard name for a floating point exception, and fpe_handler is the name of a function that will be called when we detect a SIGFPE.

In our handler, we use the Linux backtrace() function to access the stack of our program execution. This is really a C-function, so we need to use C-style arrays here.

Here’s the new version of our code:

Listing 59 undefined_trap.cpp
#include <iostream>
#include <cmath>
#include <csignal>
#include <cfenv>
#include <execinfo.h>

void fpe_handler(int s) {
    std::cout << "floating point exception, signal " << s << std::endl;
    const int nbuf = 64;                                                                                                  
    void *bt_buffer[nbuf];                                                                                                
    int nentries = backtrace(bt_buffer, nbuf); 
    char **strings = backtrace_symbols(bt_buffer, nentries);

    for (int i = 0; i < nentries; ++i) {
        std::cout << i << ": " << strings[i] << std::endl;


double trouble(double x) {
    return std::sqrt(x);

int main() {


    signal(SIGFPE, fpe_handler);

    double x{-1};

    double y = trouble(x);
    for (int i = 0; i < 10; ++i) {
        y += std::pow(x, i);

    std::cout << y << std::endl;


When we compile the code, we want to add the -g option to store the symbols in the code – this allows us to understand where problems arise:

g++ -g -o undefined_trap undefined_trap.cpp

Now when we run this, the program aborts and we see:

floating point exception, signal 8
0: ./undefined_trap() [0x401261]
1: /lib64/libc.so.6(+0x42750) [0x7f3dc35dc750]
2: /lib64/libm.so.6(+0x1435c) [0x7f3dc37d335c]
3: ./undefined_trap() [0x4012ff]
4: ./undefined_trap() [0x401347]
5: /lib64/libc.so.6(+0x2d560) [0x7f3dc35c7560]
6: /lib64/libc.so.6(__libc_start_main+0x7c) [0x7f3dc35c760c]
7: ./undefined_trap() [0x401145]
Aborted (core dumped)

This is the call stack for our program. In the brackets are the address in the program where the execution was when the FPE occurred. These are ordered such that the calling function is below the function where the execution is. So it usually is best to look at the addresses near the top.

We can turn those into line numbers using addr2line:

addr2line  -e undefined_trap 0x4012ff



and that line is precisely where the sqrt() is!

We can get slightly nicer output (including the function name) by doing:

addr2line  -C -f -i -p -e undefined_trap 0x4012ff

which gives:

trouble(double) at /home/zingale/classes/phy504/examples/floating_point/undefined_trap.cpp:23


On the MathLab machines, the stack trace seems to include an offset, like:

floating point exception, signal 8
0: ./undefined_trap(+0xc03) [0x561d8799dc03]
1: /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10) [0x7f5461e3df10]
2: /lib/x86_64-linux-gnu/libm.so.6(+0x11397) [0x7f5462201397]
3: ./undefined_trap(+0xccf) [0x561d8799dccf]
4: ./undefined_trap(+0xd21) [0x561d8799dd21]
5: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f5461e20c87]
6: ./undefined_trap(+0xaaa) [0x561d8799daaa]
Aborted (core dumped)

and we need to use that offset instead with addr2line, like:

addr2line -a -f -e ./undefined_trap +0xcd5

this example is based on Yakowitz & Szidarovszky