pw_tokenizer: Support masking tokens
Add *_MASK versions of the tokenization macros that apply a mask to the
token. The masked token is stored masked in the database, so it
detokenizes like any other token.
Change-Id: I20114bf2c1bb3cf4873256023899abb175114a4c
Reviewed-on: https://pigweed-review.googlesource.com/c/pigweed/pigweed/+/37323
Commit-Queue: Wyatt Hepler <hepler@google.com>
Reviewed-by: Keir Mierle <keir@google.com>
diff --git a/pw_tokenizer/docs.rst b/pw_tokenizer/docs.rst
index 3b1d0d2..a738414 100644
--- a/pw_tokenizer/docs.rst
+++ b/pw_tokenizer/docs.rst
@@ -397,6 +397,101 @@
See `Managing token databases`_ for information about the ``database.py``
command line tool.
+Smaller tokens with masking
+---------------------------
+``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using
+fewer than 32 bits does not improve runtime or code size efficiency. However,
+when tokens are packed into data structures or stored in arrays, the size of the
+token directly affects memory usage. In those cases, every bit counts, and it
+may be desireable to use fewer bits for the token.
+
+``pw_tokenizer`` allows users to provide a mask to apply to the token. This
+masked token is used in both the token database and the code. The masked token
+is not a masked version of the full 32-bit token, the masked token is the token.
+This makes it trivial to decode tokens that use fewer than 32 bits.
+
+Masking functionality is provided through the ``*_MASK`` versions of the macros.
+For example, the following generates 16-bit tokens and packs them into an
+existing value.
+
+.. code-block:: cpp
+
+ constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!");
+ uint32_t packed_word = (other_bits << 16) | token;
+
+Tokens are hashes, so tokens of any size have a collision risk. The fewer bits
+used for tokens, the more likely two strings are to hash to the same token. See
+`token collisions`_.
+
+Token collisions
+----------------
+Tokens are calculated with a hash function. It is possible for different
+strings to hash to the same token. When this happens, multiple strings will have
+the same token in the database, and it may not be possible to unambiguously
+decode a token.
+
+The detokenization tools attempt to resolve collisions automatically. Collisions
+are resolved based on two things:
+
+ - whether the tokenized data matches the strings arguments' (if any), and
+ - if / when the string was marked as having been removed from the database.
+
+Working with collisions
+^^^^^^^^^^^^^^^^^^^^^^^
+Collisions may occur occasionally. Run the command
+``python -m pw_tokenizer.database report <database>`` to see information about a
+token database, including any collisions.
+
+If there are collisions, take the following steps to resolve them.
+
+ - Change one of the colliding strings slightly to give it a new token.
+ - In C (not C++), artificial collisions may occur if strings longer than
+ ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening,
+ consider setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.
+ See ``pw_tokenizer/public/pw_tokenizer/config.h``.
+ - Run the ``mark_removed`` command with the latest version of the build
+ artifacts to mark missing strings as removed. This deprioritizes them in
+ collision resolution.
+
+ .. code-block:: sh
+
+ python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
+
+ The ``purge`` command may be used to delete these tokens from the database.
+
+Probability of collisions
+^^^^^^^^^^^^^^^^^^^^^^^^^
+Hashes of any size have a collision risk. The probability of one at least
+one collision occurring for a given number of strings is unintuitively high
+(this is known as the `birthday problem
+<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
+used for tokens, the probability of collisions increases substantially.
+
+This table shows the approximate number of strings that can be hashed to have a
+1% or 50% probability of at least one collision (assuming a uniform, random
+hash).
+
++-------+---------------------------------------+
+| Token | Collision probability by string count |
+| bits +--------------------+------------------+
+| | 50% | 1% |
++=======+====================+==================+
+| 32 | 77000 | 9300 |
++-------+--------------------+------------------+
+| 31 | 54000 | 6600 |
++-------+--------------------+------------------+
+| 24 | 4800 | 580 |
++-------+--------------------+------------------+
+| 16 | 300 | 36 |
++-------+--------------------+------------------+
+| 8 | 19 | 3 |
++-------+--------------------+------------------+
+
+Keep this table in mind when masking tokens (see `Smaller tokens with
+masking`_). 16 bits might be acceptable when tokenizing a small set of strings,
+such as module names, but won't be suitable for large sets of strings, like log
+messages.
+
Token databases
===============
Token databases store a mapping of tokens to the strings they represent. An ELF
diff --git a/pw_tokenizer/global_handlers_test.cc b/pw_tokenizer/global_handlers_test.cc
index 6cac806..fc11456 100644
--- a/pw_tokenizer/global_handlers_test.cc
+++ b/pw_tokenizer/global_handlers_test.cc
@@ -25,15 +25,17 @@
namespace {
// Constructs an array with the hashed string followed by the provided bytes.
-template <uint8_t... kData, size_t kSize>
-constexpr auto ExpectedData(const char (&format)[kSize]) {
- const uint32_t value = Hash(format);
- return std::array<uint8_t, sizeof(uint32_t) + sizeof...(kData)>{
+template <uint8_t... data, size_t size>
+constexpr auto ExpectedData(
+ const char (&format)[size],
+ uint32_t token_mask = std::numeric_limits<uint32_t>::max()) {
+ const uint32_t value = Hash(format) & token_mask;
+ return std::array<uint8_t, sizeof(uint32_t) + sizeof...(data)>{
static_cast<uint8_t>(value & 0xff),
static_cast<uint8_t>(value >> 8 & 0xff),
static_cast<uint8_t>(value >> 16 & 0xff),
static_cast<uint8_t>(value >> 24 & 0xff),
- kData...};
+ data...};
}
// Test fixture for both global handler functions. Both need a global message
@@ -91,6 +93,15 @@
EXPECT_EQ(std::memcmp(expected.data(), message_, expected.size()), 0);
}
+TEST_F(TokenizeToGlobalHandler, Mask) {
+ PW_TOKENIZE_TO_GLOBAL_HANDLER_MASK(
+ "TEST_DOMAIN", 0x00FFF000, "The answer is: %s", "5432!");
+ constexpr std::array<uint8_t, 10> expected =
+ ExpectedData<5, '5', '4', '3', '2', '!'>("The answer is: %s", 0x00FFF000);
+ ASSERT_EQ(expected.size(), message_size_bytes_);
+ EXPECT_EQ(std::memcmp(expected.data(), message_, expected.size()), 0);
+}
+
TEST_F(TokenizeToGlobalHandler, C_SequentialZigZag) {
pw_tokenizer_ToGlobalHandlerTest_SequentialZigZag();
@@ -181,6 +192,20 @@
EXPECT_EQ(payload_, 5432);
}
+TEST_F(TokenizeToGlobalHandlerWithPayload, Mask) {
+ PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_MASK(
+ "TEST_DOMAIN",
+ 0x12345678,
+ static_cast<pw_tokenizer_Payload>(5432),
+ "The answer is: %s",
+ "5432!");
+ constexpr std::array<uint8_t, 10> expected =
+ ExpectedData<5, '5', '4', '3', '2', '!'>("The answer is: %s", 0x12345678);
+ ASSERT_EQ(expected.size(), message_size_bytes_);
+ EXPECT_EQ(std::memcmp(expected.data(), message_, expected.size()), 0);
+ EXPECT_EQ(payload_, 5432);
+}
+
struct Foo {
unsigned char a;
bool b;
diff --git a/pw_tokenizer/public/pw_tokenizer/tokenize.h b/pw_tokenizer/public/pw_tokenizer/tokenize.h
index 1492a28..7b2fa49 100644
--- a/pw_tokenizer/public/pw_tokenizer/tokenize.h
+++ b/pw_tokenizer/public/pw_tokenizer/tokenize.h
@@ -65,11 +65,21 @@
PW_TOKENIZE_STRING_DOMAIN(PW_TOKENIZER_DEFAULT_DOMAIN, string_literal)
// Same as PW_TOKENIZE_STRING, but tokenizes to the specified domain.
-#define PW_TOKENIZE_STRING_DOMAIN(domain, string_literal) \
- /* assign to a variable */ PW_TOKENIZER_STRING_TOKEN(string_literal); \
- \
- _PW_TOKENIZER_RECORD_ORIGINAL_STRING( \
- PW_TOKENIZER_STRING_TOKEN(string_literal), domain, string_literal)
+#define PW_TOKENIZE_STRING_DOMAIN(domain, string_literal) \
+ PW_TOKENIZE_STRING_MASK(domain, UINT32_MAX, string_literal)
+
+// Same as PW_TOKENIZE_STRING_DOMAIN, but applies a mask to the token.
+#define PW_TOKENIZE_STRING_MASK(domain, mask, string_literal) \
+ /* assign to a variable */ _PW_TOKENIZER_MASK_TOKEN(mask, string_literal); \
+ \
+ static_assert(0 < (mask) && (mask) <= UINT32_MAX, \
+ "Tokenizer masks must be non-zero uint32_t values."); \
+ \
+ _PW_TOKENIZER_RECORD_ORIGINAL_STRING( \
+ _PW_TOKENIZER_MASK_TOKEN(mask, string_literal), domain, string_literal)
+
+#define _PW_TOKENIZER_MASK_TOKEN(mask, string_literal) \
+ ((pw_tokenizer_Token)(mask)&PW_TOKENIZER_STRING_TOKEN(string_literal))
// Encodes a tokenized string and arguments to the provided buffer. The size of
// the buffer is passed via a pointer to a size_t. After encoding is complete,
@@ -99,15 +109,21 @@
__VA_ARGS__)
// Same as PW_TOKENIZE_TO_BUFFER, but tokenizes to the specified domain.
-#define PW_TOKENIZE_TO_BUFFER_DOMAIN( \
- domain, buffer, buffer_size_pointer, format, ...) \
- do { \
- _PW_TOKENIZE_FORMAT_STRING(domain, format, __VA_ARGS__); \
- _pw_tokenizer_ToBuffer(buffer, \
- buffer_size_pointer, \
- _pw_tokenizer_token, \
- PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \
- PW_COMMA_ARGS(__VA_ARGS__)); \
+#define PW_TOKENIZE_TO_BUFFER_DOMAIN( \
+ domain, buffer, buffer_size_pointer, format, ...) \
+ PW_TOKENIZE_TO_BUFFER_MASK( \
+ domain, UINT32_MAX, buffer, buffer_size_pointer, format, __VA_ARGS__)
+
+// Same as PW_TOKENIZE_TO_BUFFER_DOMAIN, but applies a mask to the token.
+#define PW_TOKENIZE_TO_BUFFER_MASK( \
+ domain, mask, buffer, buffer_size_pointer, format, ...) \
+ do { \
+ _PW_TOKENIZE_FORMAT_STRING(domain, mask, format, __VA_ARGS__); \
+ _pw_tokenizer_ToBuffer(buffer, \
+ buffer_size_pointer, \
+ _pw_tokenizer_token, \
+ PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \
+ PW_COMMA_ARGS(__VA_ARGS__)); \
} while (0)
// Encodes a tokenized string and arguments to a buffer on the stack. The
@@ -142,13 +158,19 @@
PW_TOKENIZE_TO_CALLBACK_DOMAIN( \
PW_TOKENIZER_DEFAULT_DOMAIN, callback, format, __VA_ARGS__)
+// Same as PW_TOKENIZE_TO_CALLBACK, but tokenizes to the specified domain.
#define PW_TOKENIZE_TO_CALLBACK_DOMAIN(domain, callback, format, ...) \
- do { \
- _PW_TOKENIZE_FORMAT_STRING(domain, format, __VA_ARGS__); \
- _pw_tokenizer_ToCallback(callback, \
- _pw_tokenizer_token, \
- PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \
- PW_COMMA_ARGS(__VA_ARGS__)); \
+ PW_TOKENIZE_TO_CALLBACK_MASK( \
+ domain, UINT32_MAX, callback, format, __VA_ARGS__)
+
+// Same as PW_TOKENIZE_TO_CALLBACK_DOMAIN, but applies a mask to the token.
+#define PW_TOKENIZE_TO_CALLBACK_MASK(domain, mask, callback, format, ...) \
+ do { \
+ _PW_TOKENIZE_FORMAT_STRING(domain, mask, format, __VA_ARGS__); \
+ _pw_tokenizer_ToCallback(callback, \
+ _pw_tokenizer_token, \
+ PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \
+ PW_COMMA_ARGS(__VA_ARGS__)); \
} while (0)
PW_EXTERN_C_START
@@ -184,7 +206,7 @@
// checks that the arguments are correct, stores the format string in a special
// section, and calculates the string's token at compile time.
// clang-format off
-#define _PW_TOKENIZE_FORMAT_STRING(domain, format, ...) \
+#define _PW_TOKENIZE_FORMAT_STRING(domain, mask, format, ...) \
if (0) { /* Do not execute to prevent double evaluation of the arguments. */ \
pw_tokenizer_CheckFormatString(format PW_COMMA_ARGS(__VA_ARGS__)); \
} \
@@ -199,7 +221,7 @@
\
/* Tokenize the string to a pw_tokenizer_Token at compile time. */ \
static _PW_TOKENIZER_CONST pw_tokenizer_Token _pw_tokenizer_token = \
- PW_TOKENIZER_STRING_TOKEN(format); \
+ _PW_TOKENIZER_MASK_TOKEN(mask, format); \
\
_PW_TOKENIZER_RECORD_ORIGINAL_STRING(_pw_tokenizer_token, domain, format)
diff --git a/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler.h b/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler.h
index c315f91..72a5e7d 100644
--- a/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler.h
+++ b/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler.h
@@ -47,9 +47,14 @@
PW_TOKENIZER_DEFAULT_DOMAIN, format, __VA_ARGS__)
// Same as PW_TOKENIZE_TO_GLOBAL_HANDLER, but tokenizes to the specified domain.
-#define PW_TOKENIZE_TO_GLOBAL_HANDLER_DOMAIN(domain, format, ...) \
+#define PW_TOKENIZE_TO_GLOBAL_HANDLER_DOMAIN(domain, format, ...) \
+ PW_TOKENIZE_TO_GLOBAL_HANDLER_MASK(domain, UINT32_MAX, format, __VA_ARGS__)
+
+// Same as PW_TOKENIZE_TO_GLOBAL_HANDLER_DOMAIN, but applies a mask to the
+// token.
+#define PW_TOKENIZE_TO_GLOBAL_HANDLER_MASK(domain, mask, format, ...) \
do { \
- _PW_TOKENIZE_FORMAT_STRING(domain, format, __VA_ARGS__); \
+ _PW_TOKENIZE_FORMAT_STRING(domain, mask, format, __VA_ARGS__); \
_pw_tokenizer_ToGlobalHandler(_pw_tokenizer_token, \
PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \
PW_COMMA_ARGS(__VA_ARGS__)); \
diff --git a/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler_with_payload.h b/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler_with_payload.h
index 55914f7..2577e48 100644
--- a/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler_with_payload.h
+++ b/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler_with_payload.h
@@ -46,10 +46,17 @@
// Same as PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD, but tokenizes to the
// specified domain.
-#define PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_DOMAIN( \
- domain, payload, format, ...) \
+#define PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_DOMAIN( \
+ domain, payload, format, ...) \
+ PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_MASK( \
+ domain, UINT32_MAX, payload, format, __VA_ARGS__)
+
+// Same as PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_DOMAIN, but applies a mask
+// to the token.
+#define PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_MASK( \
+ domain, mask, payload, format, ...) \
do { \
- _PW_TOKENIZE_FORMAT_STRING(domain, format, __VA_ARGS__); \
+ _PW_TOKENIZE_FORMAT_STRING(domain, mask, format, __VA_ARGS__); \
_pw_tokenizer_ToGlobalHandlerWithPayload( \
payload, \
_pw_tokenizer_token, \
diff --git a/pw_tokenizer/tokenize_test.cc b/pw_tokenizer/tokenize_test.cc
index 01c6bc5..9849a54 100644
--- a/pw_tokenizer/tokenize_test.cc
+++ b/pw_tokenizer/tokenize_test.cc
@@ -18,6 +18,7 @@
#include <cstdint>
#include <cstring>
#include <iterator>
+#include <limits>
#include "gtest/gtest.h"
#include "pw_tokenizer/hash.h"
@@ -28,15 +29,17 @@
namespace {
// Constructs an array with the hashed string followed by the provided bytes.
-template <uint8_t... kData, size_t kSize>
-constexpr auto ExpectedData(const char (&format)[kSize]) {
- const uint32_t value = Hash(format);
- return std::array<uint8_t, sizeof(uint32_t) + sizeof...(kData)>{
+template <uint8_t... data, size_t size>
+constexpr auto ExpectedData(
+ const char (&format)[size],
+ uint32_t token_mask = std::numeric_limits<uint32_t>::max()) {
+ const uint32_t value = Hash(format) & token_mask;
+ return std::array<uint8_t, sizeof(uint32_t) + sizeof...(data)>{
static_cast<uint8_t>(value & 0xff),
static_cast<uint8_t>(value >> 8 & 0xff),
static_cast<uint8_t>(value >> 16 & 0xff),
static_cast<uint8_t>(value >> 24 & 0xff),
- kData...};
+ data...};
}
TEST(TokenizeString, EmptyString_IsZero) {
@@ -65,6 +68,22 @@
EXPECT_EQ(Hash("???"), TokenizedWithinClass().kThisToken);
}
+TEST(TokenizeString, Mask) {
+ [[maybe_unused]] constexpr uint32_t token = PW_TOKENIZE_STRING("(O_o)");
+ [[maybe_unused]] constexpr uint32_t masked_1 =
+ PW_TOKENIZE_STRING_MASK("domain", 0xAAAAAAAA, "(O_o)");
+ [[maybe_unused]] constexpr uint32_t masked_2 =
+ PW_TOKENIZE_STRING_MASK("domain", 0x55555555, "(O_o)");
+ [[maybe_unused]] constexpr uint32_t masked_3 =
+ PW_TOKENIZE_STRING_MASK("domain", 0xFFFF0000, "(O_o)");
+
+ static_assert(token != masked_1 && token != masked_2 && token != masked_3);
+ static_assert(masked_1 != masked_2 && masked_2 != masked_3);
+ static_assert((token & 0xAAAAAAAA) == masked_1);
+ static_assert((token & 0x55555555) == masked_2);
+ static_assert((token & 0xFFFF0000) == masked_3);
+}
+
// Use a function with a shorter name to test tokenizing __func__ and
// __PRETTY_FUNCTION__.
//
@@ -324,6 +343,23 @@
EXPECT_EQ(std::memcmp(expected.data(), buffer_, expected.size()), 0);
}
+TEST_F(TokenizeToBuffer, Mask) {
+ size_t message_size = sizeof(buffer_);
+
+ PW_TOKENIZE_TO_BUFFER_MASK("TEST_DOMAIN",
+ 0x0000FFFF,
+ buffer_,
+ &message_size,
+ "The answer was: %s",
+ "5432!");
+ constexpr std::array<uint8_t, 10> expected =
+ ExpectedData<5, '5', '4', '3', '2', '!'>("The answer was: %s",
+ 0x0000FFFF);
+
+ ASSERT_EQ(expected.size(), message_size);
+ EXPECT_EQ(std::memcmp(expected.data(), buffer_, expected.size()), 0);
+}
+
TEST_F(TokenizeToBuffer, TruncateArgs) {
// Args that can't fit are dropped completely
size_t message_size = 6;
@@ -472,6 +508,15 @@
EXPECT_EQ(std::memcmp(expected.data(), message_, expected.size()), 0);
}
+TEST_F(TokenizeToCallback, Mask) {
+ PW_TOKENIZE_TO_CALLBACK_MASK(
+ "TEST_DOMAIN", 0x00000FFF, SetMessage, "The answer is: %s", "5432!");
+ constexpr std::array<uint8_t, 10> expected =
+ ExpectedData<5, '5', '4', '3', '2', '!'>("The answer is: %s", 0x00000FFF);
+ ASSERT_EQ(expected.size(), message_size_bytes_);
+ EXPECT_EQ(std::memcmp(expected.data(), message_, expected.size()), 0);
+}
+
TEST_F(TokenizeToCallback, C_SequentialZigZag) {
pw_tokenizer_ToCallbackTest_SequentialZigZag(SetMessage);