blob: 21e8ae5e2bcd574f90383b28d0d50fe4d09d9486 [file] [log] [blame]
.. _docs-size-optimizations:
Size Optimizations
This page contains recommendations for optimizing the size of embedded software
including its memory and code footprints.
These recommendations are subject to change as the C++ standard and compilers
evolve, and as the authors continue to gain more knowledge and experience in
this area. If you disagree with recommendations, please discuss them with the
Pigweed team, as we're always looking to improve the guide or correct any
Compile Time Constant Expressions
The use of `constexpr <>`_
and soon with C++20
`consteval <>`_ can enable
you to evaluate the value of a function or variable more at compile-time rather
than only at run-time. This can often not only result in smaller sizes but also
often times more efficient, faster execution.
We highly encourage using this aspect of C++, however there is one caveat: be
careful in marking functions constexpr in APIs which cannot be easily changed
in the future unless you can prove that for all time and all platforms, the
computation can actually be done at compile time. This is because there is no
"mutable" escape hatch for constexpr.
See the :doc:`embedded_cpp_guide` for more detail.
The compiler implements templates by generating a separate version of the
function for each set of types it is instantiated with. This can increase code
size significantly.
Be careful when instantiating non-trivial template functions with multiple
Consider splitting templated interfaces into multiple layers so that more of the
implementation can be shared between different instantiations. A more advanced
form is to share common logic internally by using default sentinel template
argument value and ergo instantation such as ``pw::Vector``'s
``size_t kMaxSize = vector_impl::kGeneric`` or ``pw::span``'s
``size_t Extent = dynamic_extent``.
Virtual Functions
Virtual functions provide for runtime polymorphism. Unless runtime polymorphism
is required, virtual functions should be avoided. Virtual functions require a
virtual table and a pointer to it in each instance, which all increases RAM
usage and requires extra instructions at each call site. Virtual functions can
also inhibit compiler optimizations, since the compiler may not be able to tell
which functions will actually be invoked. This can prevent linker garbage
collection, resulting in unused functions being linked into a binary.
When runtime polymorphism is required, virtual functions should be considered.
C alternatives, such as a struct of function pointers, could be used instead,
but these approaches may offer no performance advantage while sacrificing
flexibility and ease of use.
Only use virtual functions when runtime polymorphism is needed. Lastly try to
avoid templated virtual interfaces which can compound the cost by instantiating
many virtual tables.
When you do use virtual functions, try to keep devirtualization in mind. You can
make it easier on the compiler and linker by declaring class definitions as
``final`` to improve the odds. This can help significantly depending on your
If you're interested in more details,
`this is an interesting deep dive <>`_.
Initialization, Constructors, Finalizers, and Destructors
Where possible consider making your constructors constexpr to reduce their
costs. This also enables global instances to be eligible for ``.data`` or if
all zeros for ``.bss`` section placement.
Static Destructors And Finalizers
For many embedded projects, cleaning up after the program is not a requirement,
meaning the exit functions including any finalizers registered through
``atexit``, ``at_quick_exit``, and static destructors can all be removed to
reduce the size.
The exact mechanics for disabling static destructors depends on your toolchain.
See the `Ignored Finalizer and Destructor Registration`_ section below for
further details regarding disabling registration of functions to be run at exit
via ``atexit`` and ``at_quick_exit``.
With modern versions of Clang you can simply use ``-fno-C++-static-destructors``
and you are done.
GCC with newlib-nano
With GCC this is more complicated. For example with GCC for ARM Cortex M devices
using ``newlib-nano`` you are forced to tackle the problem in two stages.
First, there are the destructors for the static global objects. These can be
placed in the ``.fini_array`` and ``.fini`` input sections through the use of
the ``-fno-use-cxa-atexit`` GCC flag, assuming ``newlib-nano`` was configured
with ``HAVE_INITFINI_ARAY_SUPPORT``. The two input sections can then be
explicitly discarded in the linker script through the use of the special
``/DISCARD/`` output section:
.. code-block:: text
/* The finalizers are never invoked when the target shuts down and ergo
* can be discarded. These include C++ global static destructors and C
* designated finalizers. */
Second, there are the destructors for the scoped static objects, frequently
referred to as Meyer's Singletons. With the Itanium ABI these use
``__cxa_atexit`` to register destruction on the fly. However, if
``-fno-use-cxa-atexit`` is used with GCC and ``newlib-nano`` these will appear
as ``__tcf_`` prefixed symbols, for example ``__tcf_0``.
There's `an interesting proposal (P1247R0) <>`_ to
enable ``[[no_destroy]]`` attributes to C++ which would be tempting to use here.
Alas this is not an option yet. As mentioned in the proposal one way to remove
the destructors from these scoped statics is to wrap it in a templated wrapper
which uses placement new.
.. code-block:: cpp
#include <type_traits>
template <class T>
class NoDestroy {
template <class... Ts>
NoDestroy(Ts&&... ts) {
new (&static_) T(std::forward<Ts>(ts)...);
T& get() { return reinterpret_cast<T&>(static_); }
std::aligned_storage_t<sizeof(T), alignof(T)> static_;
This can then be used as follows to instantiate scoped statics where the
destructor will never be invoked and ergo will not be linked in.
.. code-block:: cpp
Foo& GetFoo() {
static NoDestroy<Foo> foo(foo_args);
return foo.get();
Instead of directly using strings and printf, consider using
:ref:`module-pw_tokenizer` to replace strings and printf-style formatted strings
with binary tokens during compilation. This can reduce the code size, memory
usage, I/O traffic, and even CPU utilization by replacing snprintf calls with
simple tokenization code.
Be careful when using string arguments with tokenization as these still result
in a string in your binary which is appended to your token at run time.
String Formatting
The formatted output family of printf functions in ``<cstdio>`` are quite
expensive from a code size point of view and they often rely on malloc. Instead,
where tokenization cannot be used, consider using :ref:`module-pw_string`'s
Removing all printf functions often saves more than 5KiB of code size on ARM
Cortex M devices using ``newlib-nano``.
Logging & Asserting
Using tokenized backends for logging and asserting such as
:ref:`module-pw_log_tokenized` coupled with :ref:`module-pw_assert_log` can
drastically reduce the costs. However, even with this approach there remains a
callsite cost which can add up due to arguments and including metadata.
Try to avoid string arguments and reduce unnecessary extra arguments where
possible. And consider adjusting log levels to compile out debug or even info
logs as code stabilizes and matures.
Future Plans
Going forward Pigweed is evaluating extra configuration options to do things
such as dropping log arguments for certain log levels and modules to give users
finer grained control in trading off diagnostic value and the size cost.
Threading and Synchronization Cost
Lighterweight Signaling Primatives
Consider using ``pw::sync::ThreadNotification`` instead of semaphores as they
can be implemented using more efficient RTOS specific signaling primitives. For
example on FreeRTOS they can be backed by direct task notifications which are
more than 10x smaller than semaphores while also being faster.
Threads and their stack sizes
Although synchronous APIs are incredibly portable and often easier to reason
about, it is often easy to forget the large stack cost this design paradigm
comes with. We highly recommend watermarking your stacks to reduce wasted
Our snapshot integration for RTOSes such as :ref:`module-pw_thread_freertos` and
:ref:`module-pw_thread_embos` come with built in support to report stack
watermarks for threads if enabled in the kernel.
In addition, consider using asynchronous design patterns such as Active Objects
which can use :ref:`module-pw_work_queue` or similar asynchronous dispatch work
queues to effectively permit the sharing of stack allocations.
Buffer Sizing
We'd be remiss not to mention the sizing of the various buffers that may exist
in your application. You could consider watermarking them with
:ref:`module-pw_metric`. You may also be able to adjust their servicing interval
and priority, but do not forget to keep the ingress burst sizes and scheduling
jitter into account.
Standard C and C++ libraries
Toolchains are typically distributed with their preferred standard C library and
standard C++ library of choice for the target platform.
Although you do not always have a choice in what standard C library and what
standard C++ library is used or even how it's compiled, we recommend always
keeping an eye out for common sources of bloat.
The standard C library should provides the ``assert`` function or macro which
may be internally used even if your application does not invoke it directly.
Although this can be disabled through ``NDEBUG`` there typically is not a
portable way of replacing the ``assert(condition)`` implementation without
configuring and recompiling your standard C library.
However, you can consider replacing the implementation at link time with a
cheaper implementation. For example ``newlib-nano``, which comes with the
``GNU Arm Embedded Toolchain``, often has an expensive ``__assert_func``
implementation which uses ``fiprintf`` to print to ``stderr`` before invoking
``abort()``. This can be replaced with a simple ``PW_CRASH`` invocation which
can save several kilobytes in case ``fiprintf`` isn't used elsewhere.
One option to remove this bloat is to use ``--wrap`` at link time to replace
these implementations. As an example in GN you could replace it with the
following ```` file:
.. code-block:: text
# Wraps the function called by newlib's implementation of assert from stdlib.h.
# When using this, we suggest injecting :newlib_assert via pw_build_LINK_DEPS.
config("wrap_newlib_assert") {
ldflags = [ "-Wl,--wrap=__assert_func" ]
# Implements the function called by newlib's implementation of assert from
# stdlib.h which invokes __assert_func unless NDEBUG is defined.
pw_source_set("wrapped_newlib_assert") {
sources = [ "" ]
deps = [
And a ```` source file implementing the wrapped assert
.. code-block:: cpp
#include "pw_assert/check.h"
#include "pw_preprocessor/compiler.h"
// This is defined by <cassert>
extern "C" PW_NO_RETURN void __wrap___assert_func(const char*,
const char*,
const char*) {
PW_CRASH("libc assert() failure");
Ignored Finalizer and Destructor Registration
Even if no cleanup is done during shutdown for your target, shutdown functions
such as ``atexit``, ``at_quick_exit``, and ``__cxa_atexit`` can sometimes not be
linked out. This may be due to vendor code or perhaps using scoped statics, also
known as Meyer's Singletons.
The registration of these destructors and finalizers may include locks, malloc,
and more depending on your standard C library and its configuration.
One option to remove this bloat is to use ``--wrap`` at link time to replace
these implementations with ones which do nothing. As an example in GN you could
replace it with the following ```` file:
.. code-block:: text
config("wrap_atexit") {
ldflags = [
# Implements atexit, at_quick_exit, and __cxa_atexit from stdlib.h with noop
# versions for targets which do not cleanup during exit and quick_exit.
# This removes any dependencies which may exist in your existing libc.
# Although this removes the ability for things such as Meyer's Singletons,
# i.e. non-global statics, to register destruction function it does not permit
# them to be garbage collected by the linker.
pw_source_set("wrapped_noop_atexit") {
sources = [ "" ]
And a ```` source file implementing the noop functions:
.. code-block:: cpp
// These two are defined by <cstdlib>.
extern "C" int __wrap_atexit(void (*)(void)) { return 0; }
extern "C" int __wrap_at_quick_exit(void (*)(void)) { return 0; }
// This function is part of the Itanium C++ ABI, there is no header which
// provides this.
extern "C" int __wrap___cxa_atexit(void (*)(void*), void*, void*) { return 0; }
Unexpected Bloat in Disabled STL Exceptions
`manual <>`_
recommends using ``-fno-exceptions`` along with ``-fno-unwind-tables`` to
disable exceptions and any associated overhead. This should replace all throw
statements with calls to ``abort()``.
However, what we've noticed with the GCC and ``libstdc++`` is that there is a
risk that the STL will still throw exceptions when the application is compiled
with ``-fno-exceptions`` and there is no way for you to catch them. In theory,
this is not unsafe because the unhandled exception will invoke ``abort()`` via
``std::terminate()``. This can occur because the libraries such as
``libstdc++.a`` may not have been compiled with ``-fno-exceptions`` even though
your application is linked against it.
`this <>`_
for more information.
Unfortunately there can be significant overhead surrounding these throw call
sites in the ``std::__throw_*`` helper functions. These implementations such as
``std::__throw_out_of_range_fmt(const char*, ...)`` and
their snprintf and ergo malloc dependencies can very quickly add up to many
kilobytes of unnecessary overhead.
One option to remove this bloat while also making sure that the exceptions will
actually result in an effective ``abort()`` is to use ``--wrap`` at link time to
replace these implementations with ones which simply call ``PW_CRASH``.
As an example in GN you could replace it with the following ```` file,
note that the mangled names must be used:
.. code-block:: text
# Wraps the std::__throw_* functions called by GNU ISO C++ Library regardless
# of whether "-fno-exceptions" is specified.
# When using this, we suggest injecting :wrapped_libstdc++_functexcept via
# pw_build_LINK_DEPS.
config("wrap_libstdc++_functexcept") {
ldflags = [
# Implements the std::__throw_* functions called by GNU ISO C++ Library
# regardless of whether "-fno-exceptions" is specified with PW_CRASH.
pw_source_set("wrapped_libstdc++_functexcept") {
sources = [ "" ]
deps = [
And a ```` source file implementing each
wrapped and mangled ``std::__throw_*`` function:
.. code-block:: cpp
#include "pw_assert/check.h"
#include "pw_preprocessor/compiler.h"
// These are all wrapped implementations of the throw functions provided by
// libstdc++'s bits/functexcept.h which are not needed when "-fno-exceptions"
// is used.
// std::__throw_bad_exception(void)
extern "C" PW_NO_RETURN void __wrap__ZSt21__throw_bad_exceptionv() {
// std::__throw_bad_alloc(void)
extern "C" PW_NO_RETURN void __wrap__ZSt17__throw_bad_allocv() {
// std::__throw_bad_cast(void)
extern "C" PW_NO_RETURN void __wrap__ZSt16__throw_bad_castv() {
// std::__throw_bad_typeid(void)
extern "C" PW_NO_RETURN void __wrap__ZSt18__throw_bad_typeidv() {
// std::__throw_logic_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_logic_errorPKc(const char*) {
// std::__throw_domain_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_domain_errorPKc(const char*) {
// std::__throw_invalid_argument(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt24__throw_invalid_argumentPKc(
const char*) {
// std::__throw_length_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_length_errorPKc(const char*) {
// std::__throw_out_of_range(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_out_of_rangePKc(const char*) {
// std::__throw_out_of_range_fmt(const char*, ...)
extern "C" PW_NO_RETURN void __wrap__ZSt24__throw_out_of_range_fmtPKcz(
const char*, ...) {
// std::__throw_runtime_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt21__throw_runtime_errorPKc(
const char*) {
// std::__throw_range_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_range_errorPKc(const char*) {
// std::__throw_overflow_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt22__throw_overflow_errorPKc(
const char*) {
// std::__throw_underflow_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt23__throw_underflow_errorPKc(
const char*) {
// std::__throw_ios_failure(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_ios_failurePKc(const char*) {
// std::__throw_ios_failure(const char*, int)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_ios_failurePKci(const char*,
int) {
// std::__throw_system_error(int)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_system_errori(int) {
// std::__throw_future_error(int)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_future_errori(int) {
// std::__throw_bad_function_call(void)
extern "C" PW_NO_RETURN void __wrap__ZSt25__throw_bad_function_callv() {
Compiler and Linker Optimizations
Compiler Optimization Options
Don't forget to configure your compiler to optimize for size if needed. With
Clang this is ``-Oz`` and with GCC this can be done via ``-Os``. The GN
toolchains provided through :ref:`module-pw_toolchain` which are optimized for
size are suffixed with ``*_size_optimized``.
Garbage collect function and data sections
By default the linker will place all functions in an object within the same
linker "section" (e.g. ``.text``). With Clang and GCC you can use
``-ffunction-sections`` and ``-fdata-sections`` to use a unique "section" for
each object (e.g. ``.text.do_foo_function``). This permits you to pass
``--gc-sections`` to the linker to cull any unused sections which were not
To see what sections were garbage collected you can pass ``--print-gc-sections``
to the linker so it prints out what was removed.
The GN toolchains provided through :ref:`module-pw_toolchain` are configured to
do this by default.
Function Inlining
Don't forget to expose trivial functions such as member accessors as inline
definitions in the header. The compiler and linker can make the trade-off on
whether the function should be actually inlined or not based on your
optimization settings, however this at least gives it the option. Note that LTO
can inline functions which are not defined in headers.
We stand by the
`Google style guide <>`_
to recommend considering this for simple functions which are 10 lines or less.
Link Time Optimization (LTO)
Consider using LTO to further reduce the size of the binary if needed. This
tends to be very effective at devirtualization.
Unfortunately, the aggressive inlining done by LTO can make it much more
difficult to triage bugs. In addition, it can significantly increase the total
build time.
On GCC and Clang this can be enabled by passing ``-flto`` to both the compiler
and the linker. In addition, on GCC ``-fdevirtualize-at-ltrans`` can be
optionally used to devirtualize more aggressively.
Disabling Scoped Static Initialization Locks
C++11 requires that scoped static objects are initialized in a thread-safe
manner. This also means that scoped statics, i.e. Meyer's Singletons, be
thread-safe. Unfortunately this rarely is the case on embedded targets. For
example with GCC on an ARM Cortex M device if you test for this you will
discover that instead the program crashes as reentrant initialization is
detected through the use of guard variables.
With GCC and Clang, ``-fno-threadsafe-statics`` can be used to remove the global
lock which often does not work for embedded targets. Note that this leaves the
guard variables in place which ensure that reentrant initialization continues to
Be careful when using this option in case you are relying on threadsafe
initialization of statics and the global locks were functional for your target.
Triaging Unexpectedly Linked In Functions
Lastly as a tip if you cannot figure out why a function is being linked in you
can consider:
* Using ``--wrap`` with the linker to remove the implementation, resulting in a
link failure which typically calls out which calling function can no longer be
* With GCC, you can use ``-fcallgraph-info`` to visualize or otherwise inspect
the callgraph to figure out who is calling what.
* Sometimes symbolizing the address can resolve what a function is for. For
example if you are using ``newlib-nano`` along with ``-fno-use-cxa-atexit``,
scoped static destructors are prefixed ``__tcf_*``. To figure out object these
destructor functions are associated with, you can use ``llvm-symbolizer`` or
``addr2line`` and these will often print out the related object's name.