seed/0105-pw_tokenizer-pw_log-nested-tokens.rst - pigweed/pigweed - Git at Google

 .. _seed-0105:

 ===============================================
 0105: Nested Tokens and Tokenized Log Arguments
 ===============================================

 .. seed::
    :number: 105
    :name: Nested Tokens and Tokenized Log Arguments
    :status: Accepted
    :proposal_date: 2023-07-10
    :cl: 154190
    :authors: Gwyneth Chen
    :facilitator: Wyatt Hepler

 -------
 Summary
 -------
 This SEED describes a number of extensions to the `pw_tokenizer <https://pigweed.dev/pw_tokenizer/>`_
 and `pw_log_tokenized <https://pigweed.dev/pw_log_tokenized>`_ modules to
 improve support for nesting tokens and add facilities for tokenizing arguments
 to logs such as strings or and enums. This SEED primarily addresses C/C++
 tokenization and Python/C++ detokenization.

 ----------
 Motivation
 ----------
 Currently, ``pw_tokenizer`` and ``pw_log_tokenized`` enable devices with limited
 memory to store long log format strings as hashed 32-bit tokens. When logs are
 moved off-device, host tooling can recover the full logs using token databases
 that were created when building the device image. However, logs may still have
 runtime string arguments that are stored and transferred 1:1 without additional
 encoding. This SEED aims to extend tokenization to these arguments to further
 reduce the weight of logging for embedded applications.

 The proposed changes affect both the tokenization module itself and the logging
 facilities built on top of tokenization.

 --------
 Proposal
 --------
 Logging enums such as ``pw::Status`` is one common special case where
 tokenization is particularly appropriate: enum values are conceptually
 already tokens mapping to their names, assuming no duplicate values. Logging
 enums frequently entails creating functions and string names that occupy space
 exclusively for logging purposes, which this proposal seeks to mitigate.
 Here, ``pw::Status::NotFound()`` is presented as an illustrative example of
 the several transformations that strings undergo during tokenization and
 detokenization, further complicated in the proposed design by nested tokens.

 .. list-table:: Enum Tokenization/Detokenization Phases
    :widths: 20 45

    * - (1) Source code
      - ``PW_LOG("Status: " PW_LOG_ENUM_FMT(pw::Status), status.code())``
    * - (2) Token database entries (token, string, domain)
      - | ``16170adf, "Status: ${pw::Status}#%08x", ""``
        | ``5       , "PW_STATUS_NOT_FOUND"       , "pw::Status"``
    * - (3) Wire format
      - ``df 0a 17 16 0a`` (5 bytes)
    * - (4) Top-level detokenized and formatted
      - ``"Status: ${pw::Status}#00000005"``
    * - (5) Fully detokenized
      - ``"Status: PW_STATUS_NOT_FOUND"``

 Compared to log tokenization without nesting, string literals in token
 database entries may not be identical to what is typed in source code due
 to the use of macros and preprocessor string concatenation. The
 detokenizer also takes an additional step to recursively detokenize any
 nested tokens. In exchange for this added complexity, nested enum tokenization
 allows us to gain the readability of logging value names with zero additional
 runtime space or performance cost compared to logging the integral values
 directly with ``pw_log_tokenized``.

 .. note::
   Without nested enum token support, users can select either readability or
   reduced binary and transmission size, but not easily both:

   .. list-table::
     :widths: 15 20 20
     :header-rows: 1

     * -
       - Raw integers
       - String names
     * - (1) Source code
       - ``PW_LOG("Status: %x" , status.code())``
       - ``PW_LOG("Status: %s" , pw_StatusString(status))``
     * - (2) Token database entries (token, string, domain)
       - ``03a83461, "Status: %x", ""``
       - ``069c3ef0, "Status: %s", ""``
     * - (3) Wire format
       - ``61 34 a8 03 0a`` (5 bytes)
       - ``f0 3e 9c 06 09 4e 4f 54 5f 46 4f 55 4e 44`` (14 bytes)
     * - (4) Top-level detokenized and formatted
       - ``"Status: 5"``
       - ``"Status: PW_STATUS_NOT_FOUND"``
     * - (5) Fully detokenized
       - ``"Status: 5"``
       - ``"Status: PW_STATUS_NOT_FOUND"``

 Tokenization (C/C++)
 ====================
 The ``pw_log_tokenized`` module exposes a set of macros for creating and
 formatting nested tokens. Within format strings in the source code, tokens
 are specified using function-like PRI-style macros. These can be used to
 encode static information like the token domain or a numeric base encoding
 and are macro-expanded to string literals that are concatenated with the
 rest of the format string during preprocessing. Since ``pw_log`` generally
 uses printf syntax, only bases 8, 10, and 16 are supported for integer token
 arguments via ``%[odiuxX]``.

 The provided macros enforce the token specifier syntax and keep the argument
 types in sync when switching between other ``pw_log`` backends like
 ``pw_log_basic``. These macros for basic usage are as follows:

 * ``PW_LOG_TOKEN`` and ``PW_LOG_TOKEN_EXPR`` are used to tokenize string args.
 * ``PW_LOG_TOKEN_FMT`` is used inside the format string to specify a token arg.
 * ``PW_LOG_TOKEN_TYPE`` is used if the type of a tokenized arg needs to be
   referenced, e.g. as a ``ToString`` function return type.

 .. code-block:: cpp

    #include "pw_log/log.h"
    #include "pw_log/tokenized_args.h"

    // token with default options base-16 and empty domain
    // token database literal: "The sun will come out $#%08x!"
    PW_LOG("The sun will come out " PW_LOG_TOKEN_FMT() "!", PW_LOG_TOKEN_EXPR("tomorrow"))
    // after detokenization: "The sun will come out tomorrow!"

 Additional macros are also provided specifically for enum handling. The
 ``TOKENIZE_ENUM`` macro creates ELF token database entries for each enum
 value with the specified token domain to prevent token collision between
 multiple tokenized enums. This macro is kept separate from the enum
 definition to allow things like tokenizing a preexisting enum defined in an
 external dependency.

 .. code-block:: cpp

    // enums
    namespace foo {

      enum class Color { kRed, kGreen, kBlue };

      // syntax TBD
      TOKENIZE_ENUM(
        foo::Color,
        kRed,
        kGreen,
        kBlue
      )

    } // namespace foo

    void LogColor(foo::Color color) {
      // token database literal:
      // "Color: [${foo::Color}10#%010d]"
      PW_LOG("Color: [" PW_LOG_ENUM_FMT(foo::Color, 10) "]", color)
      // after detokenization:
      // e.g. "Color: kRed"
    }

 .. admonition:: Nested Base64 tokens

   ``PW_LOG_TOKEN_FMT`` can accept 64 as the base encoding for an argument, in
   which case the argument should be a pre-encoded Base64 string argument
   (e.g. ``QAzF39==``). However, this should be avoided when possible to
   maximize space savings. Fully-formatted Base64 including the token prefix
   may also be logged with ``%s`` as before.

 Detokenization (Python)
 =======================
 ``Detokenizer.detokenize`` in Python (``Detokenizer::Detokenize`` in C++)
 will automatically recursively detokenize tokens of all known formats rather
 than requiring a separate call to ``detokenize_base64`` or similar.

 To support detokenizing domain-specific tokens, token databases support multiple
 domains, and ``database.py create`` will build a database with tokens from all
 domains by default. Specifying a domain during database creation will cause
 that domain to be treated as the default.

 When detokenization fails, tokens appear as-is in logs. If the detokenizer has
 the ``show_errors`` option set to ``True``, error messages may be printed
 inline following the raw token.

 Tokens
 ======
 Many details described here are provided via the ``PW_LOG_TOKEN_FMT`` macro, so
 users should typically not be manually formatting tokens. However, if
 detokenization fails for any reason, tokens will appear with the following
 format in the final logs and should be easily recognizable.

 Nested tokens have the following structure in partially detokenized logs
 (transformation stage 4):

 .. code-block::

    $[{DOMAIN}][BASE#]TOKEN

 The ``$`` is a common prefix required for all nested tokens. It is possible to
 configure a different common prefix if necessary, but using the default ``$``
 character is strongly recommended.

 .. list-table:: Options
    :widths: 10 30

    * - ``{DOMAIN}``
      - Specifies the token domain. If this option is omitted, the default
        (empty) domain is assumed.
    * - ``BASE#``
      - Defines the numeric base encoding of the token. Accepted values are 8,
        10, 16, and 64. If the hash symbol ``#`` is used without specifying a
        number, the base is assumed to be 16. If the base option is omitted
        entirely, the base defaults to 64 for backward compatibility. All
        encodings except Base64 are not case sensitive.

        This option may be expanded to support other bases in the future.
    * - ``TOKEN`` (required)
      - The numeric representation of the token in the given base encoding. All
        encodings except Base64 are left-padded with zeroes to the maximum width
        of a 32-bit integer in the given base. Base64 data may additionally encode
        string arguments for the detokenized token, and therefore does not have a
        maximum width. This is automatically handled by ``PW_LOG_TOKEN_FMT`` for
        supported bases.

 When used in conjunction with ``pw_log_tokenized``, the token prefix (including
 any domain and base specifications) is tokenized as part of the log format
 string and therefore incurs zero additional memory or transmission cost over
 that of the original format string. Over the wire, tokens in bases 8, 10, and
 16 are transmitted as varint-encoded integers up to 5 bytes in size. Base64
 tokens continue to be encoded as strings.

 .. warning::
   Tokens do not have a terminating character in general, which is why we
   require them to be formatted with fixed width. Otherwise, following them
   immediately with alphanumeric characters valid in their base encoding
   will cause detokenization errors.

 .. admonition:: Recognizing raw nested tokens in strings

   When a string is fully detokenized, there should no longer be any indication
   of tokenization in the final result, e.g. detokenized logs should read the
   same as plain string logs. However, if nested tokens cannot be detokenized for
   any reason, they will appear in their raw form as below:

   .. code-block::

      // Base64 token with no arguments and empty domain
      $QA19pfEQ

      // Base-10 token
      $10#0086025943

      // Base-16 token with specified domain
      ${foo_namespace::MyEnum}#0000001A

      // Base64 token with specified domain
      ${bar_namespace::MyEnum}QAQQQQ==


 ---------------------
 Problem investigation
 ---------------------
 Complex embedded device projects are perpetually seeking more RAM. For longer
 descriptive string arguments, even just a handful can take up hundreds of bytes
 that are frequently exclusively for logging purposes, without any impact on
 function.

 One of the most common potential use cases is for logging enum values.
 Inspection of one project revealed that enums accounted for some 90% of the
 string log arguments. We have encountered instances where, to save space,
 developers have avoided logging descriptive names in favor of raw enum values,
 forcing readers of logs look up or memorize the meanings of each number. Like
 with log format strings, we do know the set of possible string values that
 might be emitted in the final logs, so they should be able to be extracted
 into a token database at compile time.

 Another major challenge overall is maintaining a user interface
 that is easy to understand and use. The current primary interface through
 ``pw_log`` provides printf-style formatting, which is familiar and succinct
 for basic applications.

 We also have to contend with the interchangeable backends of ``pw_log``. The
 ``pw_log`` facade is intended as an opaque interface layer; adding syntax
 specifically for tokenized logging will break this abstraction barrier. Either
 this additional syntax would be ignored by other backends, or it might simply
 be incompatible (e.g. logging raw integer tokens instead of strings).

 Pigweed already supports one form of nested tokens via Base64 encoding. Base64
 tokens begin with ``'$'``, followed by Base64-encoded data, and may be padded
 with one or two trailing ``'='`` symbols. The Python
 ``Detokenizer.detokenize_base64`` method recursively detokenizes Base64 by
 running a regex replacement on the formatted results of each iteration. Base64
 is not merely a token format, however; it can encode any binary data in a text
 format at the cost of reduced efficiency. Therefore, Base64 tokens may include
 not only a database token that may detokenize to a format string but also
 binary-encoded arguments. Other token types are not expected to include this
 additional argument data.

 ---------------
 Detailed design
 ---------------

 Tokenization
 ============
 ``pw_tokenizer`` and ``pw_log_tokenized`` already provide much of the necessary
 functionality to support tokenized arguments. The proposed API is fully
 backward-compatible with non-nested tokenized logging.

 Token arguments are indicated in log format strings via PRI-style macros that
 are exposed by a new ``pw_log/tokenized_args.h`` header. ``PW_LOG_TOKEN_FMT``
 supplies the ``$`` token prefix, brackets around the domain, the base specifier,
 and the printf-style specifier including padding and width, i.e. ``%011o`` for
 base-8, ``%010u`` for base-10, and ``%08X`` for base-16.

 For free-standing string arguments such as those where the literals are defined
 in the log statements themselves, tokenization is performed with macros from
 ``pw_log/tokenized_args.h``. With the tokenized logging backend, these macros
 simply alias the corresponding ``PW_TOKENIZE`` macros, but they also revert to
 basic string formatting for other backends. This is achieved by placing an
 empty header file in the local ``public_overrides`` directory of
 ``pw_log_tokenized`` and checking for it in ``pw_log/tokenized_args.h`` using
 the ``__has_include`` directive.

 For variable string arguments, the API is split across locations. The string
 literals are tokenized wherever they are defined, and the string format macros
 appear in the log format strings corresponding to those string arguments.

 When tokens use non-default domains, additional work may be required to create
 the domain name and store associated tokens in the ELF.

 Enum Tokenization
 -----------------
 We use existing ``pw_tokenizer`` utilities to record the raw enum values as
 tokens corresponding to their string names in the ELF. There is no change
 required for the backend implementation; we simply skip the token calculation
 step, since we already have a value to use, and specifying a token domain is
 generally required to isolate multiple enums from token collision.

 For ease of use, we can also provide a macro that wraps the enum value list
 and encapsulates the recording of each token value-string pair in the ELF.

 When actually logging the values, users pass the enum type name as the domain
 to format specifier macro ``PW_LOG_TOKEN()``, and the enum values can be
 passed as-is to ``PW_LOG`` (casting to integers as necessary for scoped enums).
 Since integers are varint-encoded over the wire, this will only require a
 single byte for most enums.

 .. admonition:: Logging pw::status

   Note that while this immediately reduces transmission size, the code
   space occupied by the string names in ``pw::Status::str()`` cannot be
   recovered unless an entire project is converted to log ``pw::Status``
   as tokens.

   .. code-block:: cpp

      #include "pw_log/log.h"
      #include "pw_log/tokenized_args.h"
      #include "pw_status/status.h"

      pw::Status status = pw::Status::NotFound();

      // "pw::Status: ${pw::Status}#%08d"
      PW_LOG("pw::Status: " PW_LOG_TOKEN(pw::Status), status.code)
      // "pw::Status: NOT_FOUND"

 Since the token mapping entries in the ELF are optimized out of the final
 binary, the enum domains are tokenized away as part of the log format strings,
 and we don't need to store separate tokens for each enum value, this addition
 to the API would would provide enum value names in logs with zero additional
 RAM cost. Compared to logging strings with ``ToString``-style functions, we
 save space on the string names as well as the functions themselves.

 Token Database
 ==============
 Token databases will be expanded to include a column for domains, so that
 multiple domains can be encompassed in a single database rather than requiring
 separate databases for each domain. This is important because domains are being
 used to categorize tokens within a single project, rather than merely keeping
 separate projects distinct from each other. When creating a database
 from an ELF, a domain may be specified as the default domain instead of the
 empty domain. A list of domains or path to a file with a list of domains may
 also separately be specified to define which domains are to be included in
 the database; all domains are now included by default.

 When accessing a token database, both a domain and token value may be specified
 to access specific values. If a domain is not specified, the default domain
 will be assumed, retaining the same behavior as before.

 Detokenization
 ==============
 Detokenization is relatively straightforward. When the detokenizer is called,
 it will first detokenize and format the top-level token and binary argument
 data. The detokenizer will then find and replace nested tokens in the resulting
 formatted string, then rescan the result for more nested tokens up to a fixed
 number of rescans.

 For each token type or format, ``pw_tokenizer`` defines a regular expression to
 match the expected formatted output token and a helper function to convert a
 token from a particular format to its mapped value. The regular expressions for
 each token type are combined into a single regex that matches any one of the
 formats. At each recursive step for every match, each detokenization format
 will be attempted, stopping at the first successful token type and then
 recursively replacing all nested tokens in the result. Only full data encoding-
 type tokens like Base64 will also require string/argument formatting as part of
 the recursive step.

 For non-Base64 tokens, a token's base encoding as specified by ``BASE#``
 determines its set of permissible alphanumeric characters and the
 maximum token width for regex matching.

 If nested detokenization fails for any reason, the formatted token will be
 printed as-is in the output logs. If ``show_errors`` is true for the
 detokenizer, errors will appear in parentheses immediately following the
 token. Supported errors include:

 * ``(token collision)``
 * ``(missing database)``
 * ``(token not found)``

 ------------
 Alternatives
 ------------

 Protobuf-based Tokenization
 ===========================
 Tokenization may be expanded to function on structured data via protobufs.
 This can be used to make logging more flexible, as all manner of compile-time
 metadata can be freely attached to log arguments at effectively no cost.
 This will most likely involve a separate build process to generate and tokenize
 partially-populated protos and will significantly change the user API. It
 will also be a large break from the existing process in implementation, as
 the current system relies only on existing C preprocessor and C++ constexpr
 tricks to function.

 In this model, the token domain would likely be a fully-qualified
 namespace for or path to the proto definition.

 Implementing this approach also requires a method of passing ordered arguments
 to a partially-filled detokenized protobuf in a manner similar to printf-style
 string formatting, so that argument data can be efficiently encoded and
 transmitted alongside the protobuf's token, and the arguments to a particular
 proto can be disambiguated from arguments to the rest of a log statement.

 This approach will also most likely preclude plain string logging as is
 currently supported by ``pw_log``, as the implementations diverge dramatically.
 However, if pursued, this would likely be made the default logging schema
 across all platforms, including host devices.

 Custom Detokenization
 =====================
 Theoretically, individual projects could implement their own regex replacement
 schemes on top of Pigweed's detokenizer, allowing them to more flexibly define
 complex relationships between logged tokens via custom log format string
 syntax. However, Pigweed should provide utilities for nested tokenization in
 common cases such as logging enums.

 The changes proposed do not preclude additional custom detokenization schemas
 if absolutely necessary, and such practices do not appear to have been popular
 thus far in any case.

 --------------
 Open questions
 --------------
 Missing API definitions:

 * Updated APIs for creating and accessing token databases with multiple domains
 * Python nested tokenization
 * C++ nested detokenization
	.. _seed-0105:

	===============================================
	0105: Nested Tokens and Tokenized Log Arguments
	===============================================

	.. seed::
	:number: 105
	:name: Nested Tokens and Tokenized Log Arguments
	:status: Accepted
	:proposal_date: 2023-07-10
	:cl: 154190
	:authors: Gwyneth Chen
	:facilitator: Wyatt Hepler

	-------
	Summary
	-------
	This SEED describes a number of extensions to the `pw_tokenizer <https://pigweed.dev/pw_tokenizer/>`_
	and `pw_log_tokenized <https://pigweed.dev/pw_log_tokenized>`_ modules to
	improve support for nesting tokens and add facilities for tokenizing arguments
	to logs such as strings or and enums. This SEED primarily addresses C/C++
	tokenization and Python/C++ detokenization.

	----------
	Motivation
	----------
	Currently, ``pw_tokenizer`` and ``pw_log_tokenized`` enable devices with limited
	memory to store long log format strings as hashed 32-bit tokens. When logs are
	moved off-device, host tooling can recover the full logs using token databases
	that were created when building the device image. However, logs may still have
	runtime string arguments that are stored and transferred 1:1 without additional
	encoding. This SEED aims to extend tokenization to these arguments to further
	reduce the weight of logging for embedded applications.

	The proposed changes affect both the tokenization module itself and the logging
	facilities built on top of tokenization.

	--------
	Proposal
	--------
	Logging enums such as ``pw::Status`` is one common special case where
	tokenization is particularly appropriate: enum values are conceptually
	already tokens mapping to their names, assuming no duplicate values. Logging
	enums frequently entails creating functions and string names that occupy space
	exclusively for logging purposes, which this proposal seeks to mitigate.
	Here, ``pw::Status::NotFound()`` is presented as an illustrative example of
	the several transformations that strings undergo during tokenization and
	detokenization, further complicated in the proposed design by nested tokens.

	.. list-table:: Enum Tokenization/Detokenization Phases
	:widths: 20 45

	* - (1) Source code
	- ``PW_LOG("Status: " PW_LOG_ENUM_FMT(pw::Status), status.code())``
	* - (2) Token database entries (token, string, domain)
	- \| ``16170adf, "Status: ${pw::Status}#%08x", ""``
	\| ``5 , "PW_STATUS_NOT_FOUND" , "pw::Status"``
	* - (3) Wire format
	- ``df 0a 17 16 0a`` (5 bytes)
	* - (4) Top-level detokenized and formatted
	- ``"Status: ${pw::Status}#00000005"``
	* - (5) Fully detokenized
	- ``"Status: PW_STATUS_NOT_FOUND"``

	Compared to log tokenization without nesting, string literals in token
	database entries may not be identical to what is typed in source code due
	to the use of macros and preprocessor string concatenation. The
	detokenizer also takes an additional step to recursively detokenize any
	nested tokens. In exchange for this added complexity, nested enum tokenization
	allows us to gain the readability of logging value names with zero additional
	runtime space or performance cost compared to logging the integral values
	directly with ``pw_log_tokenized``.

	.. note::
	Without nested enum token support, users can select either readability or
	reduced binary and transmission size, but not easily both:

	.. list-table::
	:widths: 15 20 20
	:header-rows: 1

	* -
	- Raw integers
	- String names
	* - (1) Source code
	- ``PW_LOG("Status: %x" , status.code())``
	- ``PW_LOG("Status: %s" , pw_StatusString(status))``
	* - (2) Token database entries (token, string, domain)
	- ``03a83461, "Status: %x", ""``
	- ``069c3ef0, "Status: %s", ""``
	* - (3) Wire format
	- ``61 34 a8 03 0a`` (5 bytes)
	- ``f0 3e 9c 06 09 4e 4f 54 5f 46 4f 55 4e 44`` (14 bytes)
	* - (4) Top-level detokenized and formatted
	- ``"Status: 5"``
	- ``"Status: PW_STATUS_NOT_FOUND"``
	* - (5) Fully detokenized
	- ``"Status: 5"``
	- ``"Status: PW_STATUS_NOT_FOUND"``

	Tokenization (C/C++)
	====================
	The ``pw_log_tokenized`` module exposes a set of macros for creating and
	formatting nested tokens. Within format strings in the source code, tokens
	are specified using function-like PRI-style macros. These can be used to
	encode static information like the token domain or a numeric base encoding
	and are macro-expanded to string literals that are concatenated with the
	rest of the format string during preprocessing. Since ``pw_log`` generally
	uses printf syntax, only bases 8, 10, and 16 are supported for integer token
	arguments via ``%[odiuxX]``.

	The provided macros enforce the token specifier syntax and keep the argument
	types in sync when switching between other ``pw_log`` backends like
	``pw_log_basic``. These macros for basic usage are as follows:

	* ``PW_LOG_TOKEN`` and ``PW_LOG_TOKEN_EXPR`` are used to tokenize string args.
	* ``PW_LOG_TOKEN_FMT`` is used inside the format string to specify a token arg.
	* ``PW_LOG_TOKEN_TYPE`` is used if the type of a tokenized arg needs to be
	referenced, e.g. as a ``ToString`` function return type.

	.. code-block:: cpp

	#include "pw_log/log.h"
	#include "pw_log/tokenized_args.h"

	// token with default options base-16 and empty domain
	// token database literal: "The sun will come out $#%08x!"
	PW_LOG("The sun will come out " PW_LOG_TOKEN_FMT() "!", PW_LOG_TOKEN_EXPR("tomorrow"))
	// after detokenization: "The sun will come out tomorrow!"

	Additional macros are also provided specifically for enum handling. The
	``TOKENIZE_ENUM`` macro creates ELF token database entries for each enum
	value with the specified token domain to prevent token collision between
	multiple tokenized enums. This macro is kept separate from the enum
	definition to allow things like tokenizing a preexisting enum defined in an
	external dependency.

	.. code-block:: cpp

	// enums
	namespace foo {

	enum class Color { kRed, kGreen, kBlue };

	// syntax TBD
	TOKENIZE_ENUM(
	foo::Color,
	kRed,
	kGreen,
	kBlue
	)

	} // namespace foo

	void LogColor(foo::Color color) {
	// token database literal:
	// "Color: [${foo::Color}10#%010d]"
	PW_LOG("Color: [" PW_LOG_ENUM_FMT(foo::Color, 10) "]", color)
	// after detokenization:
	// e.g. "Color: kRed"
	}

	.. admonition:: Nested Base64 tokens

	``PW_LOG_TOKEN_FMT`` can accept 64 as the base encoding for an argument, in
	which case the argument should be a pre-encoded Base64 string argument
	(e.g. ``QAzF39==``). However, this should be avoided when possible to
	maximize space savings. Fully-formatted Base64 including the token prefix
	may also be logged with ``%s`` as before.

	Detokenization (Python)
	=======================
	``Detokenizer.detokenize`` in Python (``Detokenizer::Detokenize`` in C++)
	will automatically recursively detokenize tokens of all known formats rather
	than requiring a separate call to ``detokenize_base64`` or similar.

	To support detokenizing domain-specific tokens, token databases support multiple
	domains, and ``database.py create`` will build a database with tokens from all
	domains by default. Specifying a domain during database creation will cause
	that domain to be treated as the default.

	When detokenization fails, tokens appear as-is in logs. If the detokenizer has
	the ``show_errors`` option set to ``True``, error messages may be printed
	inline following the raw token.

	Tokens
	======
	Many details described here are provided via the ``PW_LOG_TOKEN_FMT`` macro, so
	users should typically not be manually formatting tokens. However, if
	detokenization fails for any reason, tokens will appear with the following
	format in the final logs and should be easily recognizable.

	Nested tokens have the following structure in partially detokenized logs
	(transformation stage 4):

	.. code-block::

	$[{DOMAIN}][BASE#]TOKEN

	The ``$`` is a common prefix required for all nested tokens. It is possible to
	configure a different common prefix if necessary, but using the default ``$``
	character is strongly recommended.

	.. list-table:: Options
	:widths: 10 30

	* - ``{DOMAIN}``
	- Specifies the token domain. If this option is omitted, the default
	(empty) domain is assumed.
	* - ``BASE#``
	- Defines the numeric base encoding of the token. Accepted values are 8,
	10, 16, and 64. If the hash symbol ``#`` is used without specifying a
	number, the base is assumed to be 16. If the base option is omitted
	entirely, the base defaults to 64 for backward compatibility. All
	encodings except Base64 are not case sensitive.

	This option may be expanded to support other bases in the future.
	* - ``TOKEN`` (required)
	- The numeric representation of the token in the given base encoding. All
	encodings except Base64 are left-padded with zeroes to the maximum width
	of a 32-bit integer in the given base. Base64 data may additionally encode
	string arguments for the detokenized token, and therefore does not have a
	maximum width. This is automatically handled by ``PW_LOG_TOKEN_FMT`` for
	supported bases.

	When used in conjunction with ``pw_log_tokenized``, the token prefix (including
	any domain and base specifications) is tokenized as part of the log format
	string and therefore incurs zero additional memory or transmission cost over
	that of the original format string. Over the wire, tokens in bases 8, 10, and
	16 are transmitted as varint-encoded integers up to 5 bytes in size. Base64
	tokens continue to be encoded as strings.

	.. warning::
	Tokens do not have a terminating character in general, which is why we
	require them to be formatted with fixed width. Otherwise, following them
	immediately with alphanumeric characters valid in their base encoding
	will cause detokenization errors.

	.. admonition:: Recognizing raw nested tokens in strings

	When a string is fully detokenized, there should no longer be any indication
	of tokenization in the final result, e.g. detokenized logs should read the
	same as plain string logs. However, if nested tokens cannot be detokenized for
	any reason, they will appear in their raw form as below:

	.. code-block::

	// Base64 token with no arguments and empty domain
	$QA19pfEQ

	// Base-10 token
	$10#0086025943

	// Base-16 token with specified domain
	${foo_namespace::MyEnum}#0000001A

	// Base64 token with specified domain
	${bar_namespace::MyEnum}QAQQQQ==


	---------------------
	Problem investigation
	---------------------
	Complex embedded device projects are perpetually seeking more RAM. For longer
	descriptive string arguments, even just a handful can take up hundreds of bytes
	that are frequently exclusively for logging purposes, without any impact on
	function.

	One of the most common potential use cases is for logging enum values.
	Inspection of one project revealed that enums accounted for some 90% of the
	string log arguments. We have encountered instances where, to save space,
	developers have avoided logging descriptive names in favor of raw enum values,
	forcing readers of logs look up or memorize the meanings of each number. Like
	with log format strings, we do know the set of possible string values that
	might be emitted in the final logs, so they should be able to be extracted
	into a token database at compile time.

	Another major challenge overall is maintaining a user interface
	that is easy to understand and use. The current primary interface through
	``pw_log`` provides printf-style formatting, which is familiar and succinct
	for basic applications.

	We also have to contend with the interchangeable backends of ``pw_log``. The
	``pw_log`` facade is intended as an opaque interface layer; adding syntax
	specifically for tokenized logging will break this abstraction barrier. Either
	this additional syntax would be ignored by other backends, or it might simply
	be incompatible (e.g. logging raw integer tokens instead of strings).

	Pigweed already supports one form of nested tokens via Base64 encoding. Base64
	tokens begin with ``'$'``, followed by Base64-encoded data, and may be padded
	with one or two trailing ``'='`` symbols. The Python
	``Detokenizer.detokenize_base64`` method recursively detokenizes Base64 by
	running a regex replacement on the formatted results of each iteration. Base64
	is not merely a token format, however; it can encode any binary data in a text
	format at the cost of reduced efficiency. Therefore, Base64 tokens may include
	not only a database token that may detokenize to a format string but also
	binary-encoded arguments. Other token types are not expected to include this
	additional argument data.

	---------------
	Detailed design
	---------------

	Tokenization
	============
	``pw_tokenizer`` and ``pw_log_tokenized`` already provide much of the necessary
	functionality to support tokenized arguments. The proposed API is fully
	backward-compatible with non-nested tokenized logging.

	Token arguments are indicated in log format strings via PRI-style macros that
	are exposed by a new ``pw_log/tokenized_args.h`` header. ``PW_LOG_TOKEN_FMT``
	supplies the ``$`` token prefix, brackets around the domain, the base specifier,
	and the printf-style specifier including padding and width, i.e. ``%011o`` for
	base-8, ``%010u`` for base-10, and ``%08X`` for base-16.

	For free-standing string arguments such as those where the literals are defined
	in the log statements themselves, tokenization is performed with macros from
	``pw_log/tokenized_args.h``. With the tokenized logging backend, these macros
	simply alias the corresponding ``PW_TOKENIZE`` macros, but they also revert to
	basic string formatting for other backends. This is achieved by placing an
	empty header file in the local ``public_overrides`` directory of
	``pw_log_tokenized`` and checking for it in ``pw_log/tokenized_args.h`` using
	the ``__has_include`` directive.

	For variable string arguments, the API is split across locations. The string
	literals are tokenized wherever they are defined, and the string format macros
	appear in the log format strings corresponding to those string arguments.

	When tokens use non-default domains, additional work may be required to create
	the domain name and store associated tokens in the ELF.

	Enum Tokenization
	-----------------
	We use existing ``pw_tokenizer`` utilities to record the raw enum values as
	tokens corresponding to their string names in the ELF. There is no change
	required for the backend implementation; we simply skip the token calculation
	step, since we already have a value to use, and specifying a token domain is
	generally required to isolate multiple enums from token collision.

	For ease of use, we can also provide a macro that wraps the enum value list
	and encapsulates the recording of each token value-string pair in the ELF.

	When actually logging the values, users pass the enum type name as the domain
	to format specifier macro ``PW_LOG_TOKEN()``, and the enum values can be
	passed as-is to ``PW_LOG`` (casting to integers as necessary for scoped enums).
	Since integers are varint-encoded over the wire, this will only require a
	single byte for most enums.

	.. admonition:: Logging pw::status

	Note that while this immediately reduces transmission size, the code
	space occupied by the string names in ``pw::Status::str()`` cannot be
	recovered unless an entire project is converted to log ``pw::Status``
	as tokens.

	.. code-block:: cpp

	#include "pw_log/log.h"
	#include "pw_log/tokenized_args.h"
	#include "pw_status/status.h"

	pw::Status status = pw::Status::NotFound();

	// "pw::Status: ${pw::Status}#%08d"
	PW_LOG("pw::Status: " PW_LOG_TOKEN(pw::Status), status.code)
	// "pw::Status: NOT_FOUND"

	Since the token mapping entries in the ELF are optimized out of the final
	binary, the enum domains are tokenized away as part of the log format strings,
	and we don't need to store separate tokens for each enum value, this addition
	to the API would would provide enum value names in logs with zero additional
	RAM cost. Compared to logging strings with ``ToString``-style functions, we
	save space on the string names as well as the functions themselves.

	Token Database
	==============
	Token databases will be expanded to include a column for domains, so that
	multiple domains can be encompassed in a single database rather than requiring
	separate databases for each domain. This is important because domains are being
	used to categorize tokens within a single project, rather than merely keeping
	separate projects distinct from each other. When creating a database
	from an ELF, a domain may be specified as the default domain instead of the
	empty domain. A list of domains or path to a file with a list of domains may
	also separately be specified to define which domains are to be included in
	the database; all domains are now included by default.

	When accessing a token database, both a domain and token value may be specified
	to access specific values. If a domain is not specified, the default domain
	will be assumed, retaining the same behavior as before.

	Detokenization
	==============
	Detokenization is relatively straightforward. When the detokenizer is called,
	it will first detokenize and format the top-level token and binary argument
	data. The detokenizer will then find and replace nested tokens in the resulting
	formatted string, then rescan the result for more nested tokens up to a fixed
	number of rescans.

	For each token type or format, ``pw_tokenizer`` defines a regular expression to
	match the expected formatted output token and a helper function to convert a
	token from a particular format to its mapped value. The regular expressions for
	each token type are combined into a single regex that matches any one of the
	formats. At each recursive step for every match, each detokenization format
	will be attempted, stopping at the first successful token type and then
	recursively replacing all nested tokens in the result. Only full data encoding-
	type tokens like Base64 will also require string/argument formatting as part of
	the recursive step.

	For non-Base64 tokens, a token's base encoding as specified by ``BASE#``
	determines its set of permissible alphanumeric characters and the
	maximum token width for regex matching.

	If nested detokenization fails for any reason, the formatted token will be
	printed as-is in the output logs. If ``show_errors`` is true for the
	detokenizer, errors will appear in parentheses immediately following the
	token. Supported errors include:

	* ``(token collision)``
	* ``(missing database)``
	* ``(token not found)``

	------------
	Alternatives
	------------

	Protobuf-based Tokenization
	===========================
	Tokenization may be expanded to function on structured data via protobufs.
	This can be used to make logging more flexible, as all manner of compile-time
	metadata can be freely attached to log arguments at effectively no cost.
	This will most likely involve a separate build process to generate and tokenize
	partially-populated protos and will significantly change the user API. It
	will also be a large break from the existing process in implementation, as
	the current system relies only on existing C preprocessor and C++ constexpr
	tricks to function.

	In this model, the token domain would likely be a fully-qualified
	namespace for or path to the proto definition.

	Implementing this approach also requires a method of passing ordered arguments
	to a partially-filled detokenized protobuf in a manner similar to printf-style
	string formatting, so that argument data can be efficiently encoded and
	transmitted alongside the protobuf's token, and the arguments to a particular
	proto can be disambiguated from arguments to the rest of a log statement.

	This approach will also most likely preclude plain string logging as is
	currently supported by ``pw_log``, as the implementations diverge dramatically.
	However, if pursued, this would likely be made the default logging schema
	across all platforms, including host devices.

	Custom Detokenization
	=====================
	Theoretically, individual projects could implement their own regex replacement
	schemes on top of Pigweed's detokenizer, allowing them to more flexibly define
	complex relationships between logged tokens via custom log format string
	syntax. However, Pigweed should provide utilities for nested tokenization in
	common cases such as logging enums.

	The changes proposed do not preclude additional custom detokenization schemas
	if absolutely necessary, and such practices do not appear to have been popular
	thus far in any case.

	--------------
	Open questions
	--------------
	Missing API definitions:

	* Updated APIs for creating and accessing token databases with multiple domains
	* Python nested tokenization
	* C++ nested detokenization