| .. _module-pw_tokenizer-api: |
| |
| ============= |
| API reference |
| ============= |
| .. pigweed-module-subpage:: |
| :name: pw_tokenizer |
| :tagline: Cut your log sizes in half |
| :nav: |
| getting started: module-pw_tokenizer-get-started |
| design: module-pw_tokenizer-design |
| api: module-pw_tokenizer-api |
| cli: module-pw_tokenizer-cli |
| |
| .. _module-pw_tokenizer-api-tokenization: |
| |
| ------------ |
| Tokenization |
| ------------ |
| Tokenization converts a string literal to a token. If it's a printf-style |
| string, its arguments are encoded along with it. The results of tokenization can |
| be sent off device or stored in place of a full string. |
| |
| .. doxygentypedef:: pw_tokenizer_Token |
| |
| Tokenization macros |
| =================== |
| Adding tokenization to a project is simple. To tokenize a string, include |
| ``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros. |
| |
| Tokenize a string literal |
| ------------------------- |
| ``pw_tokenizer`` provides macros for tokenizing string literals with no |
| arguments. |
| |
| .. doxygendefine:: PW_TOKENIZE_STRING |
| .. doxygendefine:: PW_TOKENIZE_STRING_DOMAIN |
| .. doxygendefine:: PW_TOKENIZE_STRING_MASK |
| |
| The tokenization macros above cannot be used inside other expressions. |
| |
| .. admonition:: **Yes**: Assign :c:macro:`PW_TOKENIZE_STRING` to a ``constexpr`` variable. |
| :class: checkmark |
| |
| .. code:: cpp |
| |
| constexpr uint32_t kGlobalToken = PW_TOKENIZE_STRING("Wowee Zowee!"); |
| |
| void Function() { |
| constexpr uint32_t local_token = PW_TOKENIZE_STRING("Wowee Zowee?"); |
| } |
| |
| .. admonition:: **No**: Use :c:macro:`PW_TOKENIZE_STRING` in another expression. |
| :class: error |
| |
| .. code:: cpp |
| |
| void BadExample() { |
| ProcessToken(PW_TOKENIZE_STRING("This won't compile!")); |
| } |
| |
| Use :c:macro:`PW_TOKENIZE_STRING_EXPR` instead. |
| |
| An alternate set of macros are provided for use inside expressions. These make |
| use of lambda functions, so while they can be used inside expressions, they |
| require C++ and cannot be assigned to constexpr variables or be used with |
| special function variables like ``__func__``. |
| |
| .. doxygendefine:: PW_TOKENIZE_STRING_EXPR |
| .. doxygendefine:: PW_TOKENIZE_STRING_DOMAIN_EXPR |
| .. doxygendefine:: PW_TOKENIZE_STRING_MASK_EXPR |
| |
| .. admonition:: When to use these macros |
| |
| Use :c:macro:`PW_TOKENIZE_STRING` and related macros to tokenize string |
| literals that do not need %-style arguments encoded. |
| |
| .. admonition:: **Yes**: Use :c:macro:`PW_TOKENIZE_STRING_EXPR` within other expressions. |
| :class: checkmark |
| |
| .. code:: cpp |
| |
| void GoodExample() { |
| ProcessToken(PW_TOKENIZE_STRING_EXPR("This will compile!")); |
| } |
| |
| .. admonition:: **No**: Assign :c:macro:`PW_TOKENIZE_STRING_EXPR` to a ``constexpr`` variable. |
| :class: error |
| |
| .. code:: cpp |
| |
| constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR("This won't compile!")); |
| |
| Instead, use :c:macro:`PW_TOKENIZE_STRING` to assign to a ``constexpr`` variable. |
| |
| .. admonition:: **No**: Tokenize ``__func__`` in :c:macro:`PW_TOKENIZE_STRING_EXPR`. |
| :class: error |
| |
| .. code:: cpp |
| |
| void BadExample() { |
| // This compiles, but __func__ will not be the outer function's name, and |
| // there may be compiler warnings. |
| constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR(__func__); |
| } |
| |
| Instead, use :c:macro:`PW_TOKENIZE_STRING` to tokenize ``__func__`` or similar macros. |
| |
| Tokenize a message with arguments to a buffer |
| --------------------------------------------- |
| .. doxygendefine:: PW_TOKENIZE_TO_BUFFER |
| .. doxygendefine:: PW_TOKENIZE_TO_BUFFER_DOMAIN |
| .. doxygendefine:: PW_TOKENIZE_TO_BUFFER_MASK |
| |
| .. admonition:: Why use this macro |
| |
| - Encode a tokenized message for consumption within a function. |
| - Encode a tokenized message into an existing buffer. |
| |
| Avoid using ``PW_TOKENIZE_TO_BUFFER`` in widely expanded macros, such as a |
| logging macro, because it will result in larger code size than passing the |
| tokenized data to a function. |
| |
| .. _module-pw_tokenizer-custom-macro: |
| |
| Tokenize a message with arguments in a custom macro |
| --------------------------------------------------- |
| Projects can leverage the tokenization machinery in whichever way best suits |
| their needs. The most efficient way to use ``pw_tokenizer`` is to pass tokenized |
| data to a global handler function. A project's custom tokenization macro can |
| handle tokenized data in a function of their choosing. |
| |
| ``pw_tokenizer`` provides two low-level macros for projects to use |
| to create custom tokenization macros. |
| |
| .. doxygendefine:: PW_TOKENIZE_FORMAT_STRING |
| .. doxygendefine:: PW_TOKENIZER_ARG_TYPES |
| |
| The outputs of these macros are typically passed to an encoding function. That |
| function encodes the token, argument types, and argument data to a buffer using |
| helpers provided by ``pw_tokenizer/encode_args.h``. |
| |
| .. doxygenfunction:: pw::tokenizer::EncodeArgs |
| .. doxygenclass:: pw::tokenizer::EncodedMessage |
| :members: |
| .. doxygenfunction:: pw_tokenizer_EncodeArgs |
| |
| Tokenizing function names |
| ========================= |
| The string literal tokenization functions support tokenizing string literals or |
| constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the |
| special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared |
| as ``static constexpr char[]`` in C++ instead of the standard ``static const |
| char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be |
| tokenized while compiling C++ with GCC or Clang. |
| |
| .. code-block:: cpp |
| |
| // Tokenize the special function name variables. |
| constexpr uint32_t function = PW_TOKENIZE_STRING(__func__); |
| constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__); |
| |
| Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals. |
| They are defined as static character arrays, so they cannot be implicitly |
| concatentated with string literals. For example, ``printf(__func__ ": %d", |
| 123);`` will not compile. |
| |
| Encoding |
| ======== |
| The token is a 32-bit hash calculated during compilation. The string is encoded |
| little-endian with the token followed by arguments, if any. For example, the |
| 31-byte string ``You can go about your business.`` hashes to 0xdac9a244. |
| This is encoded as 4 bytes: ``44 a2 c9 da``. |
| |
| Arguments are encoded as follows: |
| |
| * **Integers** (1--10 bytes) -- |
| `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_, |
| similarly to Protocol Buffers. Smaller values take fewer bytes. |
| * **Floating point numbers** (4 bytes) -- Single precision floating point. |
| * **Strings** (1--128 bytes) -- Length byte followed by the string contents. |
| The top bit of the length whether the string was truncated or not. The |
| remaining 7 bits encode the string length, with a maximum of 127 bytes. |
| |
| .. TODO(hepler): insert diagram here! |
| |
| .. tip:: |
| ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` |
| arguments short or avoid encoding them as strings (e.g. encode an enum as an |
| integer instead of a string). See also |
| :ref:`module-pw_tokenizer-tokenized-strings-as-args`. |
| |
| Buffer sizing helper |
| -------------------- |
| .. doxygenfunction:: pw::tokenizer::MinEncodingBufferSizeBytes |
| |
| Token generation: fixed length hashing at compile time |
| ====================================================== |
| String tokens are generated using a modified version of the x65599 hash used by |
| the SDBM project. All hashing is done at compile time. |
| |
| In C code, strings are hashed with a preprocessor macro. For compatibility with |
| macros, the hash must be limited to a fixed maximum number of characters. This |
| value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing |
| ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to |
| the complexity of the hashing macros. |
| |
| C++ macros use a constexpr function instead of a macro. This function works with |
| any length of string and has lower compilation time impact than the C macros. |
| For consistency, C++ tokenization uses the same hash algorithm, but the |
| calculated values will differ between C and C++ for strings longer than |
| ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters. |
| |
| Tokenization in Python |
| ====================== |
| The Python ``pw_tokenizer.encode`` module has limited support for encoding |
| tokenized messages with the ``encode_token_and_args`` function. |
| |
| .. autofunction:: pw_tokenizer.encode.encode_token_and_args |
| |
| This function requires a string's token is already calculated. Typically these |
| tokens are provided by a database, but they can be manually created using the |
| tokenizer hash. |
| |
| .. autofunction:: pw_tokenizer.tokens.pw_tokenizer_65599_hash |
| |
| This is particularly useful for offline token database generation in cases where |
| tokenized strings in a binary cannot be embedded as parsable pw_tokenizer |
| entries. |
| |
| .. note:: |
| In C, the hash length of a string has a fixed limit controlled by |
| ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. To match tokens produced by C (as opposed |
| to C++) code, ``pw_tokenizer_65599_hash()`` should be called with a matching |
| hash length limit. When creating an offline database, it's a good idea to |
| generate tokens for both, and merge the databases. |
| |
| .. _module-pw_tokenizer-protobuf-tokenization-python: |
| |
| Protobuf tokenization library |
| ----------------------------- |
| The ``pw_tokenizer.proto`` Python module defines functions that may be used to |
| detokenize protobuf objects in Python. The function |
| :func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields annotated as |
| tokenized, replacing them with their detokenized version. For example: |
| |
| .. code-block:: python |
| |
| my_detokenizer = pw_tokenizer.Detokenizer(some_database) |
| |
| my_message = SomeMessage(tokenized_field=b'$YS1EMQ==') |
| pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message) |
| |
| assert my_message.tokenized_field == b'The detokenized string! Cool!' |
| |
| pw_tokenizer.proto |
| ^^^^^^^^^^^^^^^^^^ |
| .. automodule:: pw_tokenizer.proto |
| :members: |