Fixed Point Mathematics

Overview

Fixed-point mathematics is a method for representing numbers on a binary computer architecture. It allows the storage numbers with decimal points, similar to the float and double, but with the benefit of requiring less computation time. The trade-off is lower precision and flexibility.

A standard notation for fixed point numbers is to represent the type by fp(x, y), where x is the number of bits to the left of the decimal point, and y is the number of bits to the right of the decimal point.

If you think about it, normal integers are just a special case of a fixed-point number in where the decimal point is to the right of the least-significant bit. A 32-bit unsigned integer (uint32), would be fp(32, 0). It is common to use the bit-width of the architecture as the default fixed-point length, as the architecture has native support for manipulating these variables (commonly 32-bit on the more powerful embedded microcontrollers).

Unfortunately, low-level languages like C and C++ do not have native support for fixed-point mathematics (however there are many third-party libraries out there!). C++ has a nice advantage over C in the fact that it supports operator overloading, meaning that you can write a fixed-point library so that you could multiply/divide two fixed-point numbers just by using the ‘*’ or ‘/’ syntax, just like when dealing with other native number types (in C you would have to use functions/macros).

Notation

Q is a number format used to describe fixed-point numbers. It uses the form:

$$ Qi.f $$

where:
\(i\) = number of integer bits
\(f\) = number of fractional bits

For example, Q24.8 would represent a 32-bit fixed-point number with 24 integer bits and 8 fractional bits.

The Range Of Fixed-Point Numbers

The range of an unsigned fixed-point number with \(i\) bits for the integer and \(f\) bits for the decimal parts is:

$$ 0 \textrm{ to } (2^i -1) + 2^{-f} \times (2^{f} – 1) $$

For example, an 8-bit fixed-point number with 5 bits for the integer and 3 bits for the fractional part (Q5.3) would have a range from 0 to 31.875.

The Precision Of Fixed-Point Numbers

The precision of a fixed-point number is determined solely by the number of fractional bits. The precision is equal to:

$$ 2^{-f} $$

For example, the precision of a \(Q3.5\) fixed-point number would be \(2^{-5} = 0.03125\). The precision of a \(Q8.0\) number (no fractional bits) would be \( 2^{-0} = 1\), as expected.

Converting To Fixed-Point

Integer To Fixed Point Number

Converting from an integer to a fixed point number is easy, all you need to do is left-shift the integer by the number of fractional bits:

The rawVal_  class variable is the how the fixed-point number is stored in memory.

Double To Fixed Point Number

Converting from a double to a fixed point number is not much harder! We just need to multiple the double by a scaling factor, where the scaling factor is defined as 1 << num. fractional bits.

 

Adding/Subtracting Fixed-Point Numbers

If the numbers had the same precision, they can be added directly without any manipulation. Be wary of overflowing though! To ensure no overflow, the resultant fixed-point number has to have one more bit of integer precision than that of the inputs.

If the numbers have different precision, they must be converted to the same precision before adding. Either number can be converted to the precision of the other by bit shifting, but you must be aware that information could be lost in the process.

Multiplying Fixed-Point Numbers

Multiplying any fixed-point number other than one with no fractional part (e.g. a standard int32_t, but no one really treats that as a “fixed-point” number anyway) requires a standard multiplication, and then a division (which happens to be an easy bit shift in code).

Because the end result is less than intermediatary result from multiplying the two numbers together, care has to be taken to make sure that it does not overflow. One way to do this is to cast the inputs into a data type twice as large (in terms of bits). If using int32_t fixed-point numbers, this is possible with a cast to int64_t , which most compilers support (including embedded ones, such as GCC).

Embedded C++ Fixed-Point Library

I have written an embedded C++ fixed-point library. It is freely available for download from GitHub here.

External Resources

See Fixed-Point Representation and Fractional Math by Erick L. Oberstar.

Tonc: Fixed Point Numbers And LUTs is a good tutorial of fixed-point numbers and how they are implemented in a computer software/hardware.

From The Book of Hook, An Introduction To Fixed Point Math is another great resource.

Posted: October 23rd, 2012 at 5:16 pm
Last Updated on: January 11th, 2018 at 8:14 am