pw_tokenizer: Support masking tokens Add *_MASK versions of the tokenization macros that apply a mask to the token. The masked token is stored masked in the database, so it detokenizes like any other token. Change-Id: I20114bf2c1bb3cf4873256023899abb175114a4c Reviewed-on: https://pigweed-review.googlesource.com/c/pigweed/pigweed/+/37323 Commit-Queue: Wyatt Hepler <hepler@google.com> Reviewed-by: Keir Mierle <keir@google.com>

commit: 4b62b896ad220fd316c95029c18bc79b54c95ec3 [log] [tgz]
author: Wyatt Hepler <hepler@google.com> Thu Mar 04 10:03:43 2021 -0800
committer: CQ Bot Account <pigweed-scoped@luci-project-accounts.iam.gserviceaccount.com> Fri Mar 12 20:02:44 2021 +0000
tree: db3f58bd073d7419fe58c07b863805fb13886530
parent: 983f73fe65d500c517dd484fe670d0527c00063a [diff]
diff --git a/pw_tokenizer/docs.rst b/pw_tokenizer/docs.rst
index 3b1d0d2..a738414 100644
--- a/pw_tokenizer/docs.rst
+++ b/pw_tokenizer/docs.rst

@@ -397,6 +397,101 @@
 See `Managing token databases`_ for information about the ``database.py``
 command line tool.
 
+Smaller tokens with masking
+---------------------------
+``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using
+fewer than 32 bits does not improve runtime or code size efficiency. However,
+when tokens are packed into data structures or stored in arrays, the size of the
+token directly affects memory usage. In those cases, every bit counts, and it
+may be desireable to use fewer bits for the token.
+
+``pw_tokenizer`` allows users to provide a mask to apply to the token. This
+masked token is used in both the token database and the code. The masked token
+is not a masked version of the full 32-bit token, the masked token is the token.
+This makes it trivial to decode tokens that use fewer than 32 bits.
+
+Masking functionality is provided through the ``*_MASK`` versions of the macros.
+For example, the following generates 16-bit tokens and packs them into an
+existing value.
+
+.. code-block:: cpp
+
+  constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!");
+  uint32_t packed_word = (other_bits << 16) | token;
+
+Tokens are hashes, so tokens of any size have a collision risk. The fewer bits
+used for tokens, the more likely two strings are to hash to the same token. See
+`token collisions`_.
+
+Token collisions
+----------------
+Tokens are calculated with a hash function. It is possible for different
+strings to hash to the same token. When this happens, multiple strings will have
+the same token in the database, and it may not be possible to unambiguously
+decode a token.
+
+The detokenization tools attempt to resolve collisions automatically. Collisions
+are resolved based on two things:
+
+  - whether the tokenized data matches the strings arguments' (if any), and
+  - if / when the string was marked as having been removed from the database.
+
+Working with collisions
+^^^^^^^^^^^^^^^^^^^^^^^
+Collisions may occur occasionally. Run the command
+``python -m pw_tokenizer.database report <database>`` to see information about a
+token database, including any collisions.
+
+If there are collisions, take the following steps to resolve them.
+
+  - Change one of the colliding strings slightly to give it a new token.
+  - In C (not C++), artificial collisions may occur if strings longer than
+    ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening,
+    consider setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.
+    See ``pw_tokenizer/public/pw_tokenizer/config.h``.
+  - Run the ``mark_removed`` command with the latest version of the build
+    artifacts to mark missing strings as removed. This deprioritizes them in
+    collision resolution.
+
+    .. code-block:: sh
+
+      python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
+
+    The ``purge`` command may be used to delete these tokens from the database.
+
+Probability of collisions
+^^^^^^^^^^^^^^^^^^^^^^^^^
+Hashes of any size have a collision risk. The probability of one at least
+one collision occurring for a given number of strings is unintuitively high
+(this is known as the `birthday problem
+<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
+used for tokens, the probability of collisions increases substantially.
+
+This table shows the approximate number of strings that can be hashed to have a
+1% or 50% probability of at least one collision (assuming a uniform, random
+hash).
+
++-------+---------------------------------------+
+| Token | Collision probability by string count |
+| bits  +--------------------+------------------+
+|       |         50%        |          1%      |
++=======+====================+==================+
+|   32  |       77000        |        9300      |
++-------+--------------------+------------------+
+|   31  |       54000        |        6600      |
++-------+--------------------+------------------+
+|   24  |        4800        |         580      |
++-------+--------------------+------------------+
+|   16  |         300        |          36      |
++-------+--------------------+------------------+
+|    8  |          19        |           3      |
++-------+--------------------+------------------+
+
+Keep this table in mind when masking tokens (see `Smaller tokens with
+masking`_). 16 bits might be acceptable when tokenizing a small set of strings,
+such as module names, but won't be suitable for large sets of strings, like log
+messages.
+
 Token databases
 ===============
 Token databases store a mapping of tokens to the strings they represent. An ELF

diff --git a/pw_tokenizer/global_handlers_test.cc b/pw_tokenizer/global_handlers_test.cc
index 6cac806..fc11456 100644
--- a/pw_tokenizer/global_handlers_test.cc
+++ b/pw_tokenizer/global_handlers_test.cc

@@ -25,15 +25,17 @@
 namespace {
 
 // Constructs an array with the hashed string followed by the provided bytes.
-template <uint8_t... kData, size_t kSize>
-constexpr auto ExpectedData(const char (&format)[kSize]) {
-  const uint32_t value = Hash(format);
-  return std::array<uint8_t, sizeof(uint32_t) + sizeof...(kData)>{
+template <uint8_t... data, size_t size>
+constexpr auto ExpectedData(
+    const char (&format)[size],
+    uint32_t token_mask = std::numeric_limits<uint32_t>::max()) {
+  const uint32_t value = Hash(format) & token_mask;
+  return std::array<uint8_t, sizeof(uint32_t) + sizeof...(data)>{
       static_cast<uint8_t>(value & 0xff),
       static_cast<uint8_t>(value >> 8 & 0xff),
       static_cast<uint8_t>(value >> 16 & 0xff),
       static_cast<uint8_t>(value >> 24 & 0xff),
-      kData...};
+      data...};
 }
 
 // Test fixture for both global handler functions. Both need a global message
@@ -91,6 +93,15 @@
   EXPECT_EQ(std::memcmp(expected.data(), message_, expected.size()), 0);
 }
 
+TEST_F(TokenizeToGlobalHandler, Mask) {
+  PW_TOKENIZE_TO_GLOBAL_HANDLER_MASK(
+      "TEST_DOMAIN", 0x00FFF000, "The answer is: %s", "5432!");
+  constexpr std::array<uint8_t, 10> expected =
+      ExpectedData<5, '5', '4', '3', '2', '!'>("The answer is: %s", 0x00FFF000);
+  ASSERT_EQ(expected.size(), message_size_bytes_);
+  EXPECT_EQ(std::memcmp(expected.data(), message_, expected.size()), 0);
+}
+
 TEST_F(TokenizeToGlobalHandler, C_SequentialZigZag) {
   pw_tokenizer_ToGlobalHandlerTest_SequentialZigZag();
 
@@ -181,6 +192,20 @@
   EXPECT_EQ(payload_, 5432);
 }
 
+TEST_F(TokenizeToGlobalHandlerWithPayload, Mask) {
+  PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_MASK(
+      "TEST_DOMAIN",
+      0x12345678,
+      static_cast<pw_tokenizer_Payload>(5432),
+      "The answer is: %s",
+      "5432!");
+  constexpr std::array<uint8_t, 10> expected =
+      ExpectedData<5, '5', '4', '3', '2', '!'>("The answer is: %s", 0x12345678);
+  ASSERT_EQ(expected.size(), message_size_bytes_);
+  EXPECT_EQ(std::memcmp(expected.data(), message_, expected.size()), 0);
+  EXPECT_EQ(payload_, 5432);
+}
+
 struct Foo {
   unsigned char a;
   bool b;

diff --git a/pw_tokenizer/public/pw_tokenizer/tokenize.h b/pw_tokenizer/public/pw_tokenizer/tokenize.h
index 1492a28..7b2fa49 100644
--- a/pw_tokenizer/public/pw_tokenizer/tokenize.h
+++ b/pw_tokenizer/public/pw_tokenizer/tokenize.h

@@ -65,11 +65,21 @@
   PW_TOKENIZE_STRING_DOMAIN(PW_TOKENIZER_DEFAULT_DOMAIN, string_literal)
 
 // Same as PW_TOKENIZE_STRING, but tokenizes to the specified domain.
-#define PW_TOKENIZE_STRING_DOMAIN(domain, string_literal)               \
-  /* assign to a variable */ PW_TOKENIZER_STRING_TOKEN(string_literal); \
-                                                                        \
-  _PW_TOKENIZER_RECORD_ORIGINAL_STRING(                                 \
-      PW_TOKENIZER_STRING_TOKEN(string_literal), domain, string_literal)
+#define PW_TOKENIZE_STRING_DOMAIN(domain, string_literal) \
+  PW_TOKENIZE_STRING_MASK(domain, UINT32_MAX, string_literal)
+
+// Same as PW_TOKENIZE_STRING_DOMAIN, but applies a mask to the token.
+#define PW_TOKENIZE_STRING_MASK(domain, mask, string_literal)                \
+  /* assign to a variable */ _PW_TOKENIZER_MASK_TOKEN(mask, string_literal); \
+                                                                             \
+  static_assert(0 < (mask) && (mask) <= UINT32_MAX,                          \
+                "Tokenizer masks must be non-zero uint32_t values.");        \
+                                                                             \
+  _PW_TOKENIZER_RECORD_ORIGINAL_STRING(                                      \
+      _PW_TOKENIZER_MASK_TOKEN(mask, string_literal), domain, string_literal)
+
+#define _PW_TOKENIZER_MASK_TOKEN(mask, string_literal) \
+  ((pw_tokenizer_Token)(mask)&PW_TOKENIZER_STRING_TOKEN(string_literal))
 
 // Encodes a tokenized string and arguments to the provided buffer. The size of
 // the buffer is passed via a pointer to a size_t. After encoding is complete,
@@ -99,15 +109,21 @@
                                __VA_ARGS__)
 
 // Same as PW_TOKENIZE_TO_BUFFER, but tokenizes to the specified domain.
-#define PW_TOKENIZE_TO_BUFFER_DOMAIN(                          \
-    domain, buffer, buffer_size_pointer, format, ...)          \
-  do {                                                         \
-    _PW_TOKENIZE_FORMAT_STRING(domain, format, __VA_ARGS__);   \
-    _pw_tokenizer_ToBuffer(buffer,                             \
-                           buffer_size_pointer,                \
-                           _pw_tokenizer_token,                \
-                           PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \
-                               PW_COMMA_ARGS(__VA_ARGS__));    \
+#define PW_TOKENIZE_TO_BUFFER_DOMAIN(                 \
+    domain, buffer, buffer_size_pointer, format, ...) \
+  PW_TOKENIZE_TO_BUFFER_MASK(                         \
+      domain, UINT32_MAX, buffer, buffer_size_pointer, format, __VA_ARGS__)
+
+// Same as PW_TOKENIZE_TO_BUFFER_DOMAIN, but applies a mask to the token.
+#define PW_TOKENIZE_TO_BUFFER_MASK(                                \
+    domain, mask, buffer, buffer_size_pointer, format, ...)        \
+  do {                                                             \
+    _PW_TOKENIZE_FORMAT_STRING(domain, mask, format, __VA_ARGS__); \
+    _pw_tokenizer_ToBuffer(buffer,                                 \
+                           buffer_size_pointer,                    \
+                           _pw_tokenizer_token,                    \
+                           PW_TOKENIZER_ARG_TYPES(__VA_ARGS__)     \
+                               PW_COMMA_ARGS(__VA_ARGS__));        \
   } while (0)
 
 // Encodes a tokenized string and arguments to a buffer on the stack. The
@@ -142,13 +158,19 @@
   PW_TOKENIZE_TO_CALLBACK_DOMAIN(                      \
       PW_TOKENIZER_DEFAULT_DOMAIN, callback, format, __VA_ARGS__)
 
+// Same as PW_TOKENIZE_TO_CALLBACK, but tokenizes to the specified domain.
 #define PW_TOKENIZE_TO_CALLBACK_DOMAIN(domain, callback, format, ...) \
-  do {                                                                \
-    _PW_TOKENIZE_FORMAT_STRING(domain, format, __VA_ARGS__);          \
-    _pw_tokenizer_ToCallback(callback,                                \
-                             _pw_tokenizer_token,                     \
-                             PW_TOKENIZER_ARG_TYPES(__VA_ARGS__)      \
-                                 PW_COMMA_ARGS(__VA_ARGS__));         \
+  PW_TOKENIZE_TO_CALLBACK_MASK(                                       \
+      domain, UINT32_MAX, callback, format, __VA_ARGS__)
+
+// Same as PW_TOKENIZE_TO_CALLBACK_DOMAIN, but applies a mask to the token.
+#define PW_TOKENIZE_TO_CALLBACK_MASK(domain, mask, callback, format, ...) \
+  do {                                                                    \
+    _PW_TOKENIZE_FORMAT_STRING(domain, mask, format, __VA_ARGS__);        \
+    _pw_tokenizer_ToCallback(callback,                                    \
+                             _pw_tokenizer_token,                         \
+                             PW_TOKENIZER_ARG_TYPES(__VA_ARGS__)          \
+                                 PW_COMMA_ARGS(__VA_ARGS__));             \
   } while (0)
 
 PW_EXTERN_C_START
@@ -184,7 +206,7 @@
 // checks that the arguments are correct, stores the format string in a special
 // section, and calculates the string's token at compile time.
 // clang-format off
-#define _PW_TOKENIZE_FORMAT_STRING(domain, format, ...)                        \
+#define _PW_TOKENIZE_FORMAT_STRING(domain, mask, format, ...)                  \
   if (0) { /* Do not execute to prevent double evaluation of the arguments. */ \
     pw_tokenizer_CheckFormatString(format PW_COMMA_ARGS(__VA_ARGS__));         \
   }                                                                            \
@@ -199,7 +221,7 @@
                                                                                \
   /* Tokenize the string to a pw_tokenizer_Token at compile time. */           \
   static _PW_TOKENIZER_CONST pw_tokenizer_Token _pw_tokenizer_token =          \
-      PW_TOKENIZER_STRING_TOKEN(format);                                       \
+      _PW_TOKENIZER_MASK_TOKEN(mask, format);                                  \
                                                                                \
   _PW_TOKENIZER_RECORD_ORIGINAL_STRING(_pw_tokenizer_token, domain, format)
 

diff --git a/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler.h b/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler.h
index c315f91..72a5e7d 100644
--- a/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler.h
+++ b/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler.h

@@ -47,9 +47,14 @@
       PW_TOKENIZER_DEFAULT_DOMAIN, format, __VA_ARGS__)
 
 // Same as PW_TOKENIZE_TO_GLOBAL_HANDLER, but tokenizes to the specified domain.
-#define PW_TOKENIZE_TO_GLOBAL_HANDLER_DOMAIN(domain, format, ...)     \
+#define PW_TOKENIZE_TO_GLOBAL_HANDLER_DOMAIN(domain, format, ...) \
+  PW_TOKENIZE_TO_GLOBAL_HANDLER_MASK(domain, UINT32_MAX, format, __VA_ARGS__)
+
+// Same as PW_TOKENIZE_TO_GLOBAL_HANDLER_DOMAIN, but applies a mask to the
+// token.
+#define PW_TOKENIZE_TO_GLOBAL_HANDLER_MASK(domain, mask, format, ...) \
   do {                                                                \
-    _PW_TOKENIZE_FORMAT_STRING(domain, format, __VA_ARGS__);          \
+    _PW_TOKENIZE_FORMAT_STRING(domain, mask, format, __VA_ARGS__);    \
     _pw_tokenizer_ToGlobalHandler(_pw_tokenizer_token,                \
                                   PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \
                                       PW_COMMA_ARGS(__VA_ARGS__));    \

diff --git a/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler_with_payload.h b/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler_with_payload.h
index 55914f7..2577e48 100644
--- a/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler_with_payload.h
+++ b/pw_tokenizer/public/pw_tokenizer/tokenize_to_global_handler_with_payload.h

@@ -46,10 +46,17 @@
 
 // Same as PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD, but tokenizes to the
 // specified domain.
-#define PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_DOMAIN(               \
-    domain, payload, format, ...)                                        \
+#define PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_DOMAIN( \
+    domain, payload, format, ...)                          \
+  PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_MASK(         \
+      domain, UINT32_MAX, payload, format, __VA_ARGS__)
+
+// Same as PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_DOMAIN, but applies a mask
+// to the token.
+#define PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD_MASK(                 \
+    domain, mask, payload, format, ...)                                  \
   do {                                                                   \
-    _PW_TOKENIZE_FORMAT_STRING(domain, format, __VA_ARGS__);             \
+    _PW_TOKENIZE_FORMAT_STRING(domain, mask, format, __VA_ARGS__);       \
     _pw_tokenizer_ToGlobalHandlerWithPayload(                            \
         payload,                                                         \
         _pw_tokenizer_token,                                             \

diff --git a/pw_tokenizer/tokenize_test.cc b/pw_tokenizer/tokenize_test.cc
index 01c6bc5..9849a54 100644
--- a/pw_tokenizer/tokenize_test.cc
+++ b/pw_tokenizer/tokenize_test.cc

@@ -18,6 +18,7 @@
 #include <cstdint>
 #include <cstring>
 #include <iterator>
+#include <limits>
 
 #include "gtest/gtest.h"
 #include "pw_tokenizer/hash.h"
@@ -28,15 +29,17 @@
 namespace {
 
 // Constructs an array with the hashed string followed by the provided bytes.
-template <uint8_t... kData, size_t kSize>
-constexpr auto ExpectedData(const char (&format)[kSize]) {
-  const uint32_t value = Hash(format);
-  return std::array<uint8_t, sizeof(uint32_t) + sizeof...(kData)>{
+template <uint8_t... data, size_t size>
+constexpr auto ExpectedData(
+    const char (&format)[size],
+    uint32_t token_mask = std::numeric_limits<uint32_t>::max()) {
+  const uint32_t value = Hash(format) & token_mask;
+  return std::array<uint8_t, sizeof(uint32_t) + sizeof...(data)>{
       static_cast<uint8_t>(value & 0xff),
       static_cast<uint8_t>(value >> 8 & 0xff),
       static_cast<uint8_t>(value >> 16 & 0xff),
       static_cast<uint8_t>(value >> 24 & 0xff),
-      kData...};
+      data...};
 }
 
 TEST(TokenizeString, EmptyString_IsZero) {
@@ -65,6 +68,22 @@
   EXPECT_EQ(Hash("???"), TokenizedWithinClass().kThisToken);
 }
 
+TEST(TokenizeString, Mask) {
+  [[maybe_unused]] constexpr uint32_t token = PW_TOKENIZE_STRING("(O_o)");
+  [[maybe_unused]] constexpr uint32_t masked_1 =
+      PW_TOKENIZE_STRING_MASK("domain", 0xAAAAAAAA, "(O_o)");
+  [[maybe_unused]] constexpr uint32_t masked_2 =
+      PW_TOKENIZE_STRING_MASK("domain", 0x55555555, "(O_o)");
+  [[maybe_unused]] constexpr uint32_t masked_3 =
+      PW_TOKENIZE_STRING_MASK("domain", 0xFFFF0000, "(O_o)");
+
+  static_assert(token != masked_1 && token != masked_2 && token != masked_3);
+  static_assert(masked_1 != masked_2 && masked_2 != masked_3);
+  static_assert((token & 0xAAAAAAAA) == masked_1);
+  static_assert((token & 0x55555555) == masked_2);
+  static_assert((token & 0xFFFF0000) == masked_3);
+}
+
 // Use a function with a shorter name to test tokenizing __func__ and
 // __PRETTY_FUNCTION__.
 //
@@ -324,6 +343,23 @@
   EXPECT_EQ(std::memcmp(expected.data(), buffer_, expected.size()), 0);
 }
 
+TEST_F(TokenizeToBuffer, Mask) {
+  size_t message_size = sizeof(buffer_);
+
+  PW_TOKENIZE_TO_BUFFER_MASK("TEST_DOMAIN",
+                             0x0000FFFF,
+                             buffer_,
+                             &message_size,
+                             "The answer was: %s",
+                             "5432!");
+  constexpr std::array<uint8_t, 10> expected =
+      ExpectedData<5, '5', '4', '3', '2', '!'>("The answer was: %s",
+                                               0x0000FFFF);
+
+  ASSERT_EQ(expected.size(), message_size);
+  EXPECT_EQ(std::memcmp(expected.data(), buffer_, expected.size()), 0);
+}
+
 TEST_F(TokenizeToBuffer, TruncateArgs) {
   // Args that can't fit are dropped completely
   size_t message_size = 6;
@@ -472,6 +508,15 @@
   EXPECT_EQ(std::memcmp(expected.data(), message_, expected.size()), 0);
 }
 
+TEST_F(TokenizeToCallback, Mask) {
+  PW_TOKENIZE_TO_CALLBACK_MASK(
+      "TEST_DOMAIN", 0x00000FFF, SetMessage, "The answer is: %s", "5432!");
+  constexpr std::array<uint8_t, 10> expected =
+      ExpectedData<5, '5', '4', '3', '2', '!'>("The answer is: %s", 0x00000FFF);
+  ASSERT_EQ(expected.size(), message_size_bytes_);
+  EXPECT_EQ(std::memcmp(expected.data(), message_, expected.size()), 0);
+}
+
 TEST_F(TokenizeToCallback, C_SequentialZigZag) {
   pw_tokenizer_ToCallbackTest_SequentialZigZag(SetMessage);
commit	4b62b896ad220fd316c95029c18bc79b54c95ec3	[log] [tgz]
author	Wyatt Hepler <hepler@google.com>	Thu Mar 04 10:03:43 2021 -0800
committer	CQ Bot Account <pigweed-scoped@luci-project-accounts.iam.gserviceaccount.com>	Fri Mar 12 20:02:44 2021 +0000
tree	db3f58bd073d7419fe58c07b863805fb13886530
parent	983f73fe65d500c517dd484fe670d0527c00063a [diff]