| .. _seed-0105: |
| |
| =============================================== |
| 0105: Nested Tokens and Tokenized Log Arguments |
| =============================================== |
| |
| .. seed:: |
| :number: 105 |
| :name: Nested Tokens and Tokenized Log Arguments |
| :status: Accepted |
| :proposal_date: 2023-07-10 |
| :cl: 154190 |
| :authors: Gwyneth Chen |
| :facilitator: Wyatt Hepler |
| |
| ------- |
| Summary |
| ------- |
| This SEED describes a number of extensions to the `pw_tokenizer <https://pigweed.dev/pw_tokenizer/>`_ |
| and `pw_log_tokenized <https://pigweed.dev/pw_log_tokenized>`_ modules to |
| improve support for nesting tokens and add facilities for tokenizing arguments |
| to logs such as strings or and enums. This SEED primarily addresses C/C++ |
| tokenization and Python/C++ detokenization. |
| |
| ---------- |
| Motivation |
| ---------- |
| Currently, ``pw_tokenizer`` and ``pw_log_tokenized`` enable devices with limited |
| memory to store long log format strings as hashed 32-bit tokens. When logs are |
| moved off-device, host tooling can recover the full logs using token databases |
| that were created when building the device image. However, logs may still have |
| runtime string arguments that are stored and transferred 1:1 without additional |
| encoding. This SEED aims to extend tokenization to these arguments to further |
| reduce the weight of logging for embedded applications. |
| |
| The proposed changes affect both the tokenization module itself and the logging |
| facilities built on top of tokenization. |
| |
| -------- |
| Proposal |
| -------- |
| Logging enums such as ``pw::Status`` is one common special case where |
| tokenization is particularly appropriate: enum values are conceptually |
| already tokens mapping to their names, assuming no duplicate values. Logging |
| enums frequently entails creating functions and string names that occupy space |
| exclusively for logging purposes, which this proposal seeks to mitigate. |
| Here, ``pw::Status::NotFound()`` is presented as an illustrative example of |
| the several transformations that strings undergo during tokenization and |
| detokenization, further complicated in the proposed design by nested tokens. |
| |
| .. list-table:: Enum Tokenization/Detokenization Phases |
| :widths: 20 45 |
| |
| * - (1) Source code |
| - ``PW_LOG("Status: " PW_LOG_ENUM_FMT(pw::Status), status.code())`` |
| * - (2) Token database entries (token, string, domain) |
| - | ``16170adf, "Status: ${pw::Status}#%08x", ""`` |
| | ``5 , "PW_STATUS_NOT_FOUND" , "pw::Status"`` |
| * - (3) Wire format |
| - ``df 0a 17 16 0a`` (5 bytes) |
| * - (4) Top-level detokenized and formatted |
| - ``"Status: ${pw::Status}#00000005"`` |
| * - (5) Fully detokenized |
| - ``"Status: PW_STATUS_NOT_FOUND"`` |
| |
| Compared to log tokenization without nesting, string literals in token |
| database entries may not be identical to what is typed in source code due |
| to the use of macros and preprocessor string concatenation. The |
| detokenizer also takes an additional step to recursively detokenize any |
| nested tokens. In exchange for this added complexity, nested enum tokenization |
| allows us to gain the readability of logging value names with zero additional |
| runtime space or performance cost compared to logging the integral values |
| directly with ``pw_log_tokenized``. |
| |
| .. note:: |
| Without nested enum token support, users can select either readability or |
| reduced binary and transmission size, but not easily both: |
| |
| .. list-table:: |
| :widths: 15 20 20 |
| :header-rows: 1 |
| |
| * - |
| - Raw integers |
| - String names |
| * - (1) Source code |
| - ``PW_LOG("Status: %x" , status.code())`` |
| - ``PW_LOG("Status: %s" , pw_StatusString(status))`` |
| * - (2) Token database entries (token, string, domain) |
| - ``03a83461, "Status: %x", ""`` |
| - ``069c3ef0, "Status: %s", ""`` |
| * - (3) Wire format |
| - ``61 34 a8 03 0a`` (5 bytes) |
| - ``f0 3e 9c 06 09 4e 4f 54 5f 46 4f 55 4e 44`` (14 bytes) |
| * - (4) Top-level detokenized and formatted |
| - ``"Status: 5"`` |
| - ``"Status: PW_STATUS_NOT_FOUND"`` |
| * - (5) Fully detokenized |
| - ``"Status: 5"`` |
| - ``"Status: PW_STATUS_NOT_FOUND"`` |
| |
| Tokenization (C/C++) |
| ==================== |
| The ``pw_log_tokenized`` module exposes a set of macros for creating and |
| formatting nested tokens. Within format strings in the source code, tokens |
| are specified using function-like PRI-style macros. These can be used to |
| encode static information like the token domain or a numeric base encoding |
| and are macro-expanded to string literals that are concatenated with the |
| rest of the format string during preprocessing. Since ``pw_log`` generally |
| uses printf syntax, only bases 8, 10, and 16 are supported for integer token |
| arguments via ``%[odiuxX]``. |
| |
| The provided macros enforce the token specifier syntax and keep the argument |
| types in sync when switching between other ``pw_log`` backends like |
| ``pw_log_basic``. These macros for basic usage are as follows: |
| |
| * ``PW_LOG_TOKEN`` and ``PW_LOG_TOKEN_EXPR`` are used to tokenize string args. |
| * ``PW_LOG_TOKEN_FMT`` is used inside the format string to specify a token arg. |
| * ``PW_LOG_TOKEN_TYPE`` is used if the type of a tokenized arg needs to be |
| referenced, e.g. as a ``ToString`` function return type. |
| |
| .. code-block:: cpp |
| |
| #include "pw_log/log.h" |
| #include "pw_log/tokenized_args.h" |
| |
| // token with default options base-16 and empty domain |
| // token database literal: "The sun will come out $#%08x!" |
| PW_LOG("The sun will come out " PW_LOG_TOKEN_FMT() "!", PW_LOG_TOKEN_EXPR("tomorrow")) |
| // after detokenization: "The sun will come out tomorrow!" |
| |
| Additional macros are also provided specifically for enum handling. The |
| ``TOKENIZE_ENUM`` macro creates ELF token database entries for each enum |
| value with the specified token domain to prevent token collision between |
| multiple tokenized enums. This macro is kept separate from the enum |
| definition to allow things like tokenizing a preexisting enum defined in an |
| external dependency. |
| |
| .. code-block:: cpp |
| |
| // enums |
| namespace foo { |
| |
| enum class Color { kRed, kGreen, kBlue }; |
| |
| // syntax TBD |
| TOKENIZE_ENUM( |
| foo::Color, |
| kRed, |
| kGreen, |
| kBlue |
| ) |
| |
| } // namespace foo |
| |
| void LogColor(foo::Color color) { |
| // token database literal: |
| // "Color: [${foo::Color}10#%010d]" |
| PW_LOG("Color: [" PW_LOG_ENUM_FMT(foo::Color, 10) "]", color) |
| // after detokenization: |
| // e.g. "Color: kRed" |
| } |
| |
| .. admonition:: Nested Base64 tokens |
| |
| ``PW_LOG_TOKEN_FMT`` can accept 64 as the base encoding for an argument, in |
| which case the argument should be a pre-encoded Base64 string argument |
| (e.g. ``QAzF39==``). However, this should be avoided when possible to |
| maximize space savings. Fully-formatted Base64 including the token prefix |
| may also be logged with ``%s`` as before. |
| |
| Detokenization (Python) |
| ======================= |
| ``Detokenizer.detokenize`` in Python (``Detokenizer::Detokenize`` in C++) |
| will automatically recursively detokenize tokens of all known formats rather |
| than requiring a separate call to ``detokenize_base64`` or similar. |
| |
| To support detokenizing domain-specific tokens, token databases support multiple |
| domains, and ``database.py create`` will build a database with tokens from all |
| domains by default. Specifying a domain during database creation will cause |
| that domain to be treated as the default. |
| |
| When detokenization fails, tokens appear as-is in logs. If the detokenizer has |
| the ``show_errors`` option set to ``True``, error messages may be printed |
| inline following the raw token. |
| |
| Tokens |
| ====== |
| Many details described here are provided via the ``PW_LOG_TOKEN_FMT`` macro, so |
| users should typically not be manually formatting tokens. However, if |
| detokenization fails for any reason, tokens will appear with the following |
| format in the final logs and should be easily recognizable. |
| |
| Nested tokens have the following structure in partially detokenized logs |
| (transformation stage 4): |
| |
| .. code-block:: |
| |
| $[{DOMAIN}][BASE#]TOKEN |
| |
| The ``$`` is a common prefix required for all nested tokens. It is possible to |
| configure a different common prefix if necessary, but using the default ``$`` |
| character is strongly recommended. |
| |
| .. list-table:: Options |
| :widths: 10 30 |
| |
| * - ``{DOMAIN}`` |
| - Specifies the token domain. If this option is omitted, the default |
| (empty) domain is assumed. |
| * - ``BASE#`` |
| - Defines the numeric base encoding of the token. Accepted values are 8, |
| 10, 16, and 64. If the hash symbol ``#`` is used without specifying a |
| number, the base is assumed to be 16. If the base option is omitted |
| entirely, the base defaults to 64 for backward compatibility. All |
| encodings except Base64 are not case sensitive. |
| |
| This option may be expanded to support other bases in the future. |
| * - ``TOKEN`` (required) |
| - The numeric representation of the token in the given base encoding. All |
| encodings except Base64 are left-padded with zeroes to the maximum width |
| of a 32-bit integer in the given base. Base64 data may additionally encode |
| string arguments for the detokenized token, and therefore does not have a |
| maximum width. This is automatically handled by ``PW_LOG_TOKEN_FMT`` for |
| supported bases. |
| |
| When used in conjunction with ``pw_log_tokenized``, the token prefix (including |
| any domain and base specifications) is tokenized as part of the log format |
| string and therefore incurs zero additional memory or transmission cost over |
| that of the original format string. Over the wire, tokens in bases 8, 10, and |
| 16 are transmitted as varint-encoded integers up to 5 bytes in size. Base64 |
| tokens continue to be encoded as strings. |
| |
| .. warning:: |
| Tokens do not have a terminating character in general, which is why we |
| require them to be formatted with fixed width. Otherwise, following them |
| immediately with alphanumeric characters valid in their base encoding |
| will cause detokenization errors. |
| |
| .. admonition:: Recognizing raw nested tokens in strings |
| |
| When a string is fully detokenized, there should no longer be any indication |
| of tokenization in the final result, e.g. detokenized logs should read the |
| same as plain string logs. However, if nested tokens cannot be detokenized for |
| any reason, they will appear in their raw form as below: |
| |
| .. code-block:: |
| |
| // Base64 token with no arguments and empty domain |
| $QA19pfEQ |
| |
| // Base-10 token |
| $10#0086025943 |
| |
| // Base-16 token with specified domain |
| ${foo_namespace::MyEnum}#0000001A |
| |
| // Base64 token with specified domain |
| ${bar_namespace::MyEnum}QAQQQQ== |
| |
| |
| --------------------- |
| Problem investigation |
| --------------------- |
| Complex embedded device projects are perpetually seeking more RAM. For longer |
| descriptive string arguments, even just a handful can take up hundreds of bytes |
| that are frequently exclusively for logging purposes, without any impact on |
| function. |
| |
| One of the most common potential use cases is for logging enum values. |
| Inspection of one project revealed that enums accounted for some 90% of the |
| string log arguments. We have encountered instances where, to save space, |
| developers have avoided logging descriptive names in favor of raw enum values, |
| forcing readers of logs look up or memorize the meanings of each number. Like |
| with log format strings, we do know the set of possible string values that |
| might be emitted in the final logs, so they should be able to be extracted |
| into a token database at compile time. |
| |
| Another major challenge overall is maintaining a user interface |
| that is easy to understand and use. The current primary interface through |
| ``pw_log`` provides printf-style formatting, which is familiar and succinct |
| for basic applications. |
| |
| We also have to contend with the interchangeable backends of ``pw_log``. The |
| ``pw_log`` facade is intended as an opaque interface layer; adding syntax |
| specifically for tokenized logging will break this abstraction barrier. Either |
| this additional syntax would be ignored by other backends, or it might simply |
| be incompatible (e.g. logging raw integer tokens instead of strings). |
| |
| Pigweed already supports one form of nested tokens via Base64 encoding. Base64 |
| tokens begin with ``'$'``, followed by Base64-encoded data, and may be padded |
| with one or two trailing ``'='`` symbols. The Python |
| ``Detokenizer.detokenize_base64`` method recursively detokenizes Base64 by |
| running a regex replacement on the formatted results of each iteration. Base64 |
| is not merely a token format, however; it can encode any binary data in a text |
| format at the cost of reduced efficiency. Therefore, Base64 tokens may include |
| not only a database token that may detokenize to a format string but also |
| binary-encoded arguments. Other token types are not expected to include this |
| additional argument data. |
| |
| --------------- |
| Detailed design |
| --------------- |
| |
| Tokenization |
| ============ |
| ``pw_tokenizer`` and ``pw_log_tokenized`` already provide much of the necessary |
| functionality to support tokenized arguments. The proposed API is fully |
| backward-compatible with non-nested tokenized logging. |
| |
| Token arguments are indicated in log format strings via PRI-style macros that |
| are exposed by a new ``pw_log/tokenized_args.h`` header. ``PW_LOG_TOKEN_FMT`` |
| supplies the ``$`` token prefix, brackets around the domain, the base specifier, |
| and the printf-style specifier including padding and width, i.e. ``%011o`` for |
| base-8, ``%010u`` for base-10, and ``%08X`` for base-16. |
| |
| For free-standing string arguments such as those where the literals are defined |
| in the log statements themselves, tokenization is performed with macros from |
| ``pw_log/tokenized_args.h``. With the tokenized logging backend, these macros |
| simply alias the corresponding ``PW_TOKENIZE`` macros, but they also revert to |
| basic string formatting for other backends. This is achieved by placing an |
| empty header file in the local ``public_overrides`` directory of |
| ``pw_log_tokenized`` and checking for it in ``pw_log/tokenized_args.h`` using |
| the ``__has_include`` directive. |
| |
| For variable string arguments, the API is split across locations. The string |
| literals are tokenized wherever they are defined, and the string format macros |
| appear in the log format strings corresponding to those string arguments. |
| |
| When tokens use non-default domains, additional work may be required to create |
| the domain name and store associated tokens in the ELF. |
| |
| Enum Tokenization |
| ----------------- |
| We use existing ``pw_tokenizer`` utilities to record the raw enum values as |
| tokens corresponding to their string names in the ELF. There is no change |
| required for the backend implementation; we simply skip the token calculation |
| step, since we already have a value to use, and specifying a token domain is |
| generally required to isolate multiple enums from token collision. |
| |
| For ease of use, we can also provide a macro that wraps the enum value list |
| and encapsulates the recording of each token value-string pair in the ELF. |
| |
| When actually logging the values, users pass the enum type name as the domain |
| to format specifier macro ``PW_LOG_TOKEN()``, and the enum values can be |
| passed as-is to ``PW_LOG`` (casting to integers as necessary for scoped enums). |
| Since integers are varint-encoded over the wire, this will only require a |
| single byte for most enums. |
| |
| .. admonition:: Logging pw::status |
| |
| Note that while this immediately reduces transmission size, the code |
| space occupied by the string names in ``pw::Status::str()`` cannot be |
| recovered unless an entire project is converted to log ``pw::Status`` |
| as tokens. |
| |
| .. code-block:: cpp |
| |
| #include "pw_log/log.h" |
| #include "pw_log/tokenized_args.h" |
| #include "pw_status/status.h" |
| |
| pw::Status status = pw::Status::NotFound(); |
| |
| // "pw::Status: ${pw::Status}#%08d" |
| PW_LOG("pw::Status: " PW_LOG_TOKEN(pw::Status), status.code) |
| // "pw::Status: NOT_FOUND" |
| |
| Since the token mapping entries in the ELF are optimized out of the final |
| binary, the enum domains are tokenized away as part of the log format strings, |
| and we don't need to store separate tokens for each enum value, this addition |
| to the API would would provide enum value names in logs with zero additional |
| RAM cost. Compared to logging strings with ``ToString``-style functions, we |
| save space on the string names as well as the functions themselves. |
| |
| Token Database |
| ============== |
| Token databases will be expanded to include a column for domains, so that |
| multiple domains can be encompassed in a single database rather than requiring |
| separate databases for each domain. This is important because domains are being |
| used to categorize tokens within a single project, rather than merely keeping |
| separate projects distinct from each other. When creating a database |
| from an ELF, a domain may be specified as the default domain instead of the |
| empty domain. A list of domains or path to a file with a list of domains may |
| also separately be specified to define which domains are to be included in |
| the database; all domains are now included by default. |
| |
| When accessing a token database, both a domain and token value may be specified |
| to access specific values. If a domain is not specified, the default domain |
| will be assumed, retaining the same behavior as before. |
| |
| Detokenization |
| ============== |
| Detokenization is relatively straightforward. When the detokenizer is called, |
| it will first detokenize and format the top-level token and binary argument |
| data. The detokenizer will then find and replace nested tokens in the resulting |
| formatted string, then rescan the result for more nested tokens up to a fixed |
| number of rescans. |
| |
| For each token type or format, ``pw_tokenizer`` defines a regular expression to |
| match the expected formatted output token and a helper function to convert a |
| token from a particular format to its mapped value. The regular expressions for |
| each token type are combined into a single regex that matches any one of the |
| formats. At each recursive step for every match, each detokenization format |
| will be attempted, stopping at the first successful token type and then |
| recursively replacing all nested tokens in the result. Only full data encoding- |
| type tokens like Base64 will also require string/argument formatting as part of |
| the recursive step. |
| |
| For non-Base64 tokens, a token's base encoding as specified by ``BASE#`` |
| determines its set of permissible alphanumeric characters and the |
| maximum token width for regex matching. |
| |
| If nested detokenization fails for any reason, the formatted token will be |
| printed as-is in the output logs. If ``show_errors`` is true for the |
| detokenizer, errors will appear in parentheses immediately following the |
| token. Supported errors include: |
| |
| * ``(token collision)`` |
| * ``(missing database)`` |
| * ``(token not found)`` |
| |
| ------------ |
| Alternatives |
| ------------ |
| |
| Protobuf-based Tokenization |
| =========================== |
| Tokenization may be expanded to function on structured data via protobufs. |
| This can be used to make logging more flexible, as all manner of compile-time |
| metadata can be freely attached to log arguments at effectively no cost. |
| This will most likely involve a separate build process to generate and tokenize |
| partially-populated protos and will significantly change the user API. It |
| will also be a large break from the existing process in implementation, as |
| the current system relies only on existing C preprocessor and C++ constexpr |
| tricks to function. |
| |
| In this model, the token domain would likely be a fully-qualified |
| namespace for or path to the proto definition. |
| |
| Implementing this approach also requires a method of passing ordered arguments |
| to a partially-filled detokenized protobuf in a manner similar to printf-style |
| string formatting, so that argument data can be efficiently encoded and |
| transmitted alongside the protobuf's token, and the arguments to a particular |
| proto can be disambiguated from arguments to the rest of a log statement. |
| |
| This approach will also most likely preclude plain string logging as is |
| currently supported by ``pw_log``, as the implementations diverge dramatically. |
| However, if pursued, this would likely be made the default logging schema |
| across all platforms, including host devices. |
| |
| Custom Detokenization |
| ===================== |
| Theoretically, individual projects could implement their own regex replacement |
| schemes on top of Pigweed's detokenizer, allowing them to more flexibly define |
| complex relationships between logged tokens via custom log format string |
| syntax. However, Pigweed should provide utilities for nested tokenization in |
| common cases such as logging enums. |
| |
| The changes proposed do not preclude additional custom detokenization schemas |
| if absolutely necessary, and such practices do not appear to have been popular |
| thus far in any case. |
| |
| -------------- |
| Open questions |
| -------------- |
| Missing API definitions: |
| |
| * Updated APIs for creating and accessing token databases with multiple domains |
| * Python nested tokenization |
| * C++ nested detokenization |
| |