| .. _blog-04-fixed-point: |
| |
| ======================================================================== |
| Pigweed Blog #4: Fixed-point arithmetic as a replacement for soft floats |
| ======================================================================== |
| *Published on 2024-09-03 by Leonard Chan* |
| |
| Fixed-point arithmetic is an alternative to floating-point arithmetic for |
| representing fractional data values (0.5, -1.125, 10.75, etc.). Fixed-point |
| types can be represented as :ref:`scaled integers <blog-04-fixed-point-intro>`. |
| The advantage here is many |
| arithmetic operations (+, -, \*, /, etc.) can be implemented as normal integral |
| instructions. This can be useful for embedded systems or other applications |
| where floating-point operations can be too expensive or not available. |
| Clang has support for fixed-point types according to `ISO TR 18037 <https://standards.iso.org/ittf/PubliclyAvailableStandards/c051126_ISO_IEC_TR_18037_2008.zip>`_. |
| I work on a Google project that uses Pigweed. |
| Recently, we replaced usage of floats in an embedded project running |
| on Cortex M0+. This MCU doesn't provide any hardware support for floating-point |
| operations, so floating-point operations are implemented in software. We’ve |
| observed a **~2x speed improvement** in |
| classification algorithms and a **small decrease in binary size** using fixed-point |
| **without sacrificing correctness**. |
| |
| All Clang users can benefit from this by adding ``-ffixed-point`` as a |
| compilation flag. All of the types and semantics described |
| in ISO 18037 are supported. |
| |
| Problems with (soft) floats |
| =========================== |
| One of the ways to represent a decimal number is to use a significand, base, |
| and exponent. This is how floating-point numbers according to IEEE 754 are |
| represented. The base is two, but the significand and exponent are adjustable. |
| |
| When doing arithmetic with floats, many operations are needed to align the |
| exponents of the operands and normalize the result. Most modern computers have |
| Floating-Point Units (`FPUs <https://en.wikipedia.org/wiki/Floating-point_unit>`_) that do this arithmetic very quickly, but not all |
| platforms (especially embedded ones) have this luxury. The alternative is to |
| emulate floating-point arithmetic in software which can be costly. |
| |
| .. _blog-04-fixed-point-intro: |
| |
| ---------------------------------------------- |
| Intro to fixed-point types and scaled integers |
| ---------------------------------------------- |
| `ISO TR 18037 <https://standards.iso.org/ittf/PubliclyAvailableStandards/c051126_ISO_IEC_TR_18037_2008.zip>`_ |
| describes 12 fixed-point types with varying scales and ranges: |
| |
| * ``unsigned short _Fract`` |
| * ``unsigned _Fract`` |
| * ``unsigned long _Fract`` |
| * ``signed short _Fract`` |
| * ``signed _Fract`` |
| * ``signed long _Fract`` |
| * ``unsigned short _Accum`` |
| * ``unsigned _Accum`` |
| * ``unsigned long _Accum`` |
| * ``signed short _Accum`` |
| * ``signed _Accum`` |
| * ``signed long _Accum`` |
| |
| The scale represents the number of fractional bits in the type. Under the hood, |
| Clang implements each of these types as integers. To get the decimal value of |
| the type, we just divide the integer by the scale. For example, on x86_64 |
| Linux, an ``unsigned _Accum`` type is represented as an unsigned 32-bit integer |
| with a scale of 16, so 16 bits are used to represent the fractional part and |
| the remaining 16 are used to represent the integral part. An ``unsigned _Accum`` |
| type with a value of ``16.625`` would be represented as ``1089536`` because |
| ``1089536 / 2**16 = 16.625``. Additionally, an ``unsigned _Accum`` type has a range |
| of ``[0, 65535.99998474121]``, where the max value is represented as ``2^32-1``. |
| The smallest possible value we can represent with a fixed-point type is |
| ``1/2^scale``. |
| |
| The width and scale of each type are platform-dependent. In Clang, different |
| targets are free to provide whatever widths |
| and scales are best-suited to their needs and capabilities. Likewise, |
| LLVM backends are free to lower the different fixed-point intrinsics to native |
| fixed-point operations. The default configuration for all targets is the |
| “typical desktop processor” implementation described in A.3 of ISO TR 18037. |
| |
| Each of these types has a saturating equivalent also (denoted by the ``_Sat`` |
| keyword). This means operations on a saturating type that would normally cause |
| overflow instead clamps to the maximal or minimal value for that type. |
| |
| Fixed-point to the rescue |
| ========================= |
| One of the main advantages of fixed-point types is their operations can be done |
| with normal integer operations. Addition and subtraction between the same fixed-point |
| types normally require just a regular integral ``add`` or ``sub`` instruction. |
| Multiplication and division can normally be done with a regular ``mul`` and ``div`` |
| instruction (plus an extra shift to account for the scale). Platforms that |
| don’t support hard floats normally need to make a call to a library function |
| that implements the floating-point equivalents of these which can be large or |
| CPU-intensive. |
| |
| Fixed-point types however are limited in their range and precision which are |
| dependent on the number of integral and fractional bits. A ``signed _Accum`` |
| (which uses 15 fractional bits, 16 integral bits, and 1 sign bit on x86_64 |
| Linux) will have a range of ``[-65536, 65535.99996948242]`` and a precision of |
| ``3.0517578125e-05``. Floats can effectively range from ``(-∞, +∞)`` and have |
| precision as small as :math:`10^{-38}`. Maintaining correctness for your program |
| largely depends on how much range and precision you intend your fractional |
| numbers to have. For addressing ranges, there are sometimes a couple of clever |
| math tricks you can do to get large values to fit within these ranges. Some of |
| these methods will be discussed later. |
| |
| .. list-table:: Fixed-Point vs Floating-Point |
| :widths: 10 45 45 |
| :header-rows: 1 |
| :stub-columns: 1 |
| |
| * - |
| - Fixed-Point |
| - Floating-Point |
| * - Range |
| - Limited by integral and fractional bits |
| - +/-∞ |
| * - Precision |
| - Limited by the fractional bits (:math:`1/2^{scale}`) |
| - :math:`10^{-38}` |
| * - Binary Operations |
| - | Typically fast |
| | Usually just involves normal integral binary ops with some shifts |
| - | *Hard floats* - Typically fast |
| | *Soft floats* - Can be very slow/complex; depends on the implementation |
| * - Correctness |
| - | Always within 1 ULP of the true result |
| | Will always produce the same result for a given op |
| - | Always within 0.5 ULP of true result |
| | May not always produce the same result for a given op due to rounding errors |
| |
| --------------------- |
| Running on Cortex-M0+ |
| --------------------- |
| The Arm Cortex-M0+ processor is a common choice for constrained embedded |
| applications because it is very energy-efficient. Because it doesn’t have an |
| FPU, many floating-point operations targeting this CPU depend on a helper |
| `library <https://github.com/ARM-software/abi-aa/blob/7c2fbbd6d32c906c709804f873b67308d1ab46e2/rtabi32/rtabi32.rst#the-standard-compiler-helper-function-library>`_ |
| to define these as runtime functions. For our application, the code |
| running on this processor runs different classification algorithms. One such |
| classifier utilizes `softmax regression <http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/>`_. |
| The algorithm is roughly: |
| |
| .. code-block:: c++ |
| |
| // This is roughly a trimmed down version of the softmax regression |
| // classifier. |
| size_t Classify(std::span<const float> sample) const { |
| // 1. Compute activations for each category. |
| // All activations are initially zero. |
| std::array<float, NumCategories> activations{}; |
| for (size_t i = 0; i < NumCategories; i++) { |
| for (size_t j = 0; j < sample.size(); j++) { |
| activations[i] += coefficients_[i][j] * sample[j]; |
| } |
| activations[i] += biases_[i]; |
| } |
| float max_activation = *std::max_element(activations.begin(), |
| activations.end()); |
| |
| // 2. Get e raised to each of these activations. |
| std::array<T, NumCategories> exp_activations; |
| for (size_t i = 0; i < NumCategories; i++) { |
| exp_activations[i] = expf(activations[i]); |
| } |
| float sum_exp_activations = std::accumulate(exp_activations.begin(), |
| exp_activations.end(), 0); |
| |
| // 3. Compute each likelihood which us exp(x[i]) / sum(x). |
| float max_likelihood = std::numeric_limits<float>::min(); |
| size_t prediction; |
| for (size_t i = 0; i < NumCategories; i++) { |
| // 0 <= result.likelihoods[i] < 1 |
| result.likelihoods[i] = exp_activations[i] / sum_exp_activations; |
| if (result.likelihoods[i] > max_likelihood) { |
| max_likelihood = result.likelihoods[i]; |
| prediction = i; // The highest likelihood is the prediction. |
| } |
| } |
| |
| return prediction; // Return the index of the highest likelihood. |
| } |
| |
| Many of these operations involve normal floating-point comparison, addition, |
| multiplication, and division. Each of these expands to a call to some |
| ``__aeabi_*`` function. This particular function is on a hot path that |
| activates, meaning soft float operations take up much computation and power. |
| For the normal binary operations, fixed-point types might be a good fit because |
| it’s very likely each of the binary operations will result in a value that |
| can fit into one of the fixed-point types. (Each of |
| the sample values is in the range ``[-256, +256]`` and there are at most 10 |
| categories. Likewise, each of the ``coefficients_`` and ``biases_`` are small values |
| ranging from roughly ``(-3, +3)``.) |
| |
| If we wanted to completely replace floats for this function with fixed-point |
| types, we’d run into two issues: |
| |
| #. There doesn’t exist an ``expf`` function that operates on fixed-point types. |
| #. ``expf(x)`` can grow very large for even small for small values of x |
| (``expf(23)`` would exceed the largest value possible for an |
| ``unsigned long _Accum``). |
| |
| Fortunately, we can refactor the code as needed and we can make changes to |
| llvm-libc, the libc implementation this device uses. |
| |
| llvm-libc and some math |
| ======================= |
| The embedded device's codebase is dependent on a handful of functions that |
| take floats: ``expf``, ``sqrtf``, and ``roundf``. |
| ``printf`` is also used for printing floats, but support for the fixed-point format |
| specifiers is needed. The project already uses llvm-libc, so we can implement |
| versions of these functions that operate on fixed-point types. |
| The llvm-libc team at Google has been very responsive |
| and eager to implement these functions for us! Michael Jones (who implemented |
| much of printf in llvm-libc) provided `printf support for each of the fixed |
| point types <https://github.com/llvm/llvm-project/pull/82707>`_. Tue Ly |
| provided implementations for `abs <https://github.com/llvm/llvm-project/pull/81823>`_, |
| `roundf <https://github.com/llvm/llvm-project/pull/81994>`_, |
| `sqrtf <https://github.com/llvm/llvm-project/pull/83042>`_, |
| `integral sqrt with a fixed-point output <http://github.com/llvm/llvm-project/issues/83924>`_, |
| and `expf <https://github.com/llvm/llvm-project/pull/84391>`_ for different |
| fixed-point types. Now we have the necessary math functions which accept |
| fixed-point types. |
| |
| With implementations provided, the next big step is avoiding overflow and |
| getting results to fit within the new types. Let’s look at one case involving |
| ``expf`` and one involving ``sqrtf``. (Tue Ly also provided these solutions.) |
| |
| The softmax algorithm above easily causes overflow with: |
| |
| .. code-block:: |
| |
| exp_activations[i] = expf(activations[i]); |
| |
| But the larger goal is to get the likelihood that a sample matches each |
| category. This is a value from [0, 1]. |
| |
| .. math:: |
| |
| likelihood(i) = \frac{e^{activations[i]}}{\sum_{k=0}^{NumCategories}{e^{activation[k]}}} = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))} |
| |
| This can however be normalized by the max activation: |
| |
| .. math:: |
| |
| likelihood(i) = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))} = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))} * \frac{max(e^{activation[...]})}{max(e^{activation[...]})} = \frac{e^{activation[i]-max(activation[...])]}}{sum(e^{activation[...]-max(activation[...])}))} |
| |
| This makes the exponent for each component at most zero and the result of |
| ``exp(x)`` at most 1 which can definitely fit in the fixed-point types. |
| Likewise, the sum of each of these is at most ``NumCategories`` (which happens |
| to be 10 for us). For the above code, the only necessary change required is: |
| |
| .. code-block:: |
| |
| exp_activations[i] = expf(activations[i] - max_activation); |
| |
| And lucky for us, the precision for a ``signed _Accum`` type is enough to get |
| us the same results. One interesting thing is this change alone also improved |
| the performance when using floats by 11%. The theory is the ``expf`` implementation |
| has a quick switch to see if the exponents need to be adjusted for floats, and |
| skips that scaling part when unnecessary. |
| |
| The code involving ``sqrtf`` can be adjusted similarly: |
| |
| .. code-block:: c++ |
| |
| void TouchProcessor::CharacteriseRadialDeviation(Touch& touch) { |
| // Compute center of touch. |
| int32_t sum_x_w = 0, sum_y_w = 0, sum_w = 0; |
| // touch.num_position_estimates is at most 100 |
| for (size_t i = 0; i < touch.num_position_estimates; i++) { |
| sum_x_w += touch.position_estimates[i].position.x * 255; |
| sum_y_w += touch.position_estimates[i].position.y * 255; |
| sum_w += 255; |
| } |
| |
| // Cast is safe, since average can't exceed any of the individual values. |
| touch.center = Point{ |
| .x = static_cast<int16_t>(sum_x_w / sum_w), |
| .y = static_cast<int16_t>(sum_y_w / sum_w), |
| }; |
| |
| // Compute radial deviation from center. |
| float sum_d_squared_w = 0.0f; |
| for (size_t i = 0; i < touch.num_position_estimates; i++) { |
| int32_t dx = touch.position_estimates[i].position.x - touch.center.x; |
| int32_t dy = touch.position_estimates[i].position.y - touch.center.y; |
| sum_d_squared_w += static_cast<float>(dx * dx + dy * dy) * 255; |
| } |
| |
| // deviation = sqrt(sum((dx^2 + dy^2) * w) / sum(w)) |
| touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)] |
| = sqrtf(sum_d_squared_w / sum_w); |
| } |
| |
| We know beforehand the maximal values of ``dx`` and ``dy`` are +/-750, so |
| ``sum_d_squared_w`` will require as much as 35 integral bits to represent the |
| final sum. ``sum_w`` also needs 11 bits, so ``sqrtf`` would need to accept a value |
| that can be held in ~24 bits. This can definitely fit into a |
| ``signed long _Accum`` which has 32 integral bits, but ideally we’d like to |
| use a ``_Sat signed _Accum`` for consistency. Similar to before, we can |
| normalize the value by 255. That is, we can avoid multiplying by 255 when |
| calculating ``sum_x_w``, ``sum_y_w``, and ``sum_w`` since this will result in |
| the exact same ``touch.center`` results. Likewise, if we avoid the ``* 255`` when |
| calculating ``sum_d_squared_w``, the final ``sqrtf`` result will be unchanged. |
| This makes the code: |
| |
| .. code-block:: c++ |
| |
| void TouchProcessor::CharacteriseRadialDeviation(Touch& touch) { |
| // Compute center of touch. |
| int32_t sum_x_w = 0, sum_y_w = 0, sum_w = touch.num_position_estimates; |
| // touch.num_position_estimates is at most 100 |
| for (size_t i = 0; i < touch.num_position_estimates; i++) { |
| sum_x_w += touch.position_estimates[i].position.x; |
| sum_y_w += touch.position_estimates[i].position.y; |
| } |
| |
| // Cast is safe, since average can't exceed any of the individual values. |
| touch.center = Point{ |
| .x = static_cast<int16_t>(sum_x_w / sum_w), |
| .y = static_cast<int16_t>(sum_y_w / sum_w), |
| }; |
| |
| // Compute radial deviation from center. |
| float sum_d_squared_w = 0.0f; |
| for (size_t i = 0; i < touch.num_position_estimates; i++) { |
| int32_t dx = touch.position_estimates[i].position.x - touch.center.x; |
| int32_t dy = touch.position_estimates[i].position.y - touch.center.y; |
| sum_d_squared_w += static_cast<float>(dx * dx + dy * dy); |
| } |
| |
| // deviation = sqrt(sum((dx^2 + dy^2) * w) / sum(w)) |
| touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)] |
| = sqrtf(sum_d_squared_w / sum_w); |
| } |
| |
| This allows ``sum_d_squared_w`` to fit within 31 integral bits. We can go |
| further and realize ``sum_d_squared_w`` will always have an integral value and |
| replace its type with a ``uint32_t``. This will mean any fractional part from |
| ``sum_d_squared_w / sum_w`` will be discarded, but for our use case, this is |
| acceptable since the integral part of ``sum_d_squared_w / sum_w`` is so large |
| that the fractional part is negligible. ``touch.features[i]`` also isn’t |
| propagated significantly so we don’t need to worry about propagation of |
| error. With this, we can replace the ``sqrtf`` with one that accepts an integral |
| type and returns a fixed-point type: |
| |
| .. code-block:: c++ |
| |
| touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)] |
| = isqrt(sum_d_squared_w / sum_w); |
| |
| ------- |
| Results |
| ------- |
| |
| Classification runtime performance |
| ================================== |
| Before any of this effort, a single run of classifying sensor input took on |
| average **856.8 us**. |
| |
| Currently, with all floats substituted with fixed-point types, a single run is |
| **412.845673 us** which is a **~2.1x speed improvement**. |
| |
| It’s worth noting that the optimizations we described above also improved |
| performance when using floats, making a single run using floats |
| **771.475464 us** which translates to **~1.9x speed improvement**. |
| |
| Size |
| ==== |
| We’ve observed **~1.25 KiB** saved (out of a **~177 KiB** binary) when |
| switching to fixed-point. |
| |
| .. code-block:: |
| |
| $ bloaty app.elf.fixed -- app.elf.float |
| FILE SIZE VM SIZE |
| -------------- -------------- |
| +0.5% +3.06Ki [ = ] 0 .debug_line |
| +0.1% +1.89Ki [ = ] 0 .debug_str |
| +0.2% +496 [ = ] 0 .debug_abbrev |
| +0.6% +272 +0.6% +272 .bss |
| -0.3% -16 -0.3% -16 .data |
| -0.3% -232 [ = ] 0 .debug_frame |
| -1.3% -712 [ = ] 0 .debug_ranges |
| -0.3% -1.11Ki [ = ] 0 .debug_loc |
| -1.2% -1.50Ki -1.2% -1.50Ki .text |
| -1.6% -1.61Ki [ = ] 0 .symtab |
| -3.2% -3.06Ki [ = ] 0 .pw_tokenizer.entries |
| -1.2% -4.00Ki [ = ] 0 .strtab |
| -0.4% -19.6Ki [ = ] 0 .debug_info |
| -0.3% -26.1Ki -0.7% -1.25Ki TOTAL |
| |
| A large amount of this can be attributed to not needing many of the |
| ``__aeabi_*`` float functions provided by libc. All instances of floats were |
| removed so they don’t get linked into the final binary. Instead, we use the |
| fixed-point equivalents provided by llvm-libc. Much of these fixed-point |
| functions are also smaller and make fewer calls to other functions. For |
| example, the ``sqrtf`` we used makes calls to ``__ieee754_sqrtf``, |
| ``__aeabi_fcmpun``, ``__aeabi_fcmplt``, ``__errno``, and ``__aeabi_fdiv`` |
| whereas ``isqrtf_fast`` only makes calls to ``__clzsi2``, ``__aeabi_lmul``, and |
| ``__aeabi_uidiv``. It’s worth noting that individual fixed-point binary |
| operations are slightly larger since they involve a bunch of shifts or |
| saturation checks. Floats tend to make a call to the same library function, but |
| fixed-point types emit instructions at every callsite. |
| |
| .. code-block:: |
| |
| float_add: |
| push {r7, lr} |
| add r7, sp, #0 |
| bl __aeabi_fadd |
| pop {r7, pc} |
| |
| fixed_add: |
| adds r0, r0, r1 |
| bvc .LBB1_2 |
| asrs r1, r0, #31 |
| movs r0, #1 |
| lsls r0, r0, #31 |
| eors r0, r1 |
| .LBB1_2: |
| bx lr |
| |
| Although the callsite is shorter, it’s worth noting that the soft float |
| functions can’t be inlined because they tend to be very large. The |
| ``__aeabi_fadd`` in particular is **444 bytes large**. |
| |
| Correctness |
| =========== |
| We observe the exact same results for our classification algorithms when |
| using fixed-point types. We have **over 15000 classification tests** with none |
| of them producing differing results. This effectively means we get all the |
| mentioned benefits without any correctness cost. |
| |
| ----------------- |
| Using fixed-point |
| ----------------- |
| Fixed-point arithmetic can be used out-of-the-box with Clang by adding |
| ``-ffixed-point`` on the command line. All of the types and semantics described |
| in ISO 18037 are supported. ``llvm-libc`` is the only ``libc`` that supports |
| fixed-point printing and math functions at the moment. Not all libc functions |
| described in ISO 18037 are supported at the moment, but they will be added |
| eventually! Pigweed users can use these fixed-point functions by using |
| the ``//pw_libc:stdfix`` target. |
| |
| ----------- |
| Future work |
| ----------- |
| * An option we could take to reduce size (at the cost of even more performance) |
| is to potentially teach LLVM to outline fixed-point additions. This could result |
| in something similar to the float libcalls. |
| * See if we can instead use a regular ``signed _Accum`` in the codebase instead |
| of a ``_Sat signed _Accum`` to save some instructions (or even see if we can |
| use a smaller type). |
| * Investigate performance compared to hard floats. If we’re able to afford |
| using an unsaturated type, then many native floating-point ops could be |
| replaced with normal integral ops. |
| * Full libc support for the fixed-point C functions. |