pw_tokenizer/proto.rst - pigweed/pigweed - Git at Google

 .. _module-pw_tokenizer-proto:

 ------------------------------------
 Tokenized fields in protocol buffers
 ------------------------------------
 Text may be represented in a few different ways:

 - Plain ASCII or UTF-8 text (``This is plain text``)
 - Base64-encoded tokenized message (``$ibafcA==``)
 - Binary-encoded tokenized message (``89 b6 9f 70``)
 - Little-endian 32-bit integer token (``0x709fb689``)

 ``pw_tokenizer`` provides tools for working with protobuf fields that may
 contain tokenized text.

 Tokenized field protobuf option
 ===============================
 ``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option.
 This option may be applied to a protobuf field to indicate that it may contain a
 tokenized string. A string that is optionally tokenized is represented with a
 single ``bytes`` field annotated with ``(pw.tokenizer.format) =
 TOKENIZATION_OPTIONAL``.

 For example, the following protobuf has one field that may contain a tokenized
 string.

 .. code-block:: protobuf

   message MessageWithOptionallyTokenizedField {
     bytes just_bytes = 1;
     bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL];
     string just_text = 3;
   }

 Decoding optionally tokenized strings
 =====================================
 The encoding used for an optionally tokenized field is not recorded in the
 protobuf. Despite this, the text can reliably be decoded. This is accomplished
 by attempting to decode the field as binary or Base64 tokenized data before
 treating it like plain text.

 The following diagram describes the decoding process for optionally tokenized
 fields in detail.

 .. mermaid::

   flowchart TD
      start([Received bytes]) --> binary

      binary[Decode as<br>binary tokenized] --> binary_ok
      binary_ok{Detokenizes<br>successfully?} -->|no| utf8
      binary_ok -->|yes| done_binary([Display decoded binary])

      utf8[Decode as UTF-8] --> utf8_ok
      utf8_ok{Valid UTF-8?} -->|no| base64_encode
      utf8_ok -->|yes| base64

      base64_encode[Encode as<br>tokenized Base64] --> display
      display([Display encoded Base64])

      base64[Decode as<br>Base64 tokenized] --> base64_ok

      base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text
      base64_ok -->|yes| base64_results

      is_plain_text{Text is<br>printable?} -->|no| base64_encode
      is_plain_text-->|yes| plain_text

      base64_results([Display decoded Base64])
      plain_text([Display text])

 Potential decoding problems
 ---------------------------
 The decoding process for optionally tokenized fields will yield correct results
 in almost every situation. In rare circumstances, it is possible for it to fail,
 but these can be avoided with a low-overhead mitigation if desired.

 There are two ways in which the decoding process may fail.

 Accidentally interpreting plain text as tokenized binary
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 If a plain-text string happens to decode as a binary tokenized message, the
 incorrect message could be displayed. This is very unlikely to occur. While many
 tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely
 that a device will happen to log one of these strings as plain text. The
 overwhelming majority of these strings will be nonsense.

 If an implementation wishes to guard against this extremely improbable
 situation, it is possible to prevent it. This situation is prevented by
 appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data
 that happens to be valid UTF-8 (or all binary tokenized messages, if desired).
 When decoding, if there is an extra 0xFF byte, it is discarded.

 Displaying undecoded binary as plain text instead of Base64
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 If a message fails to decode as binary tokenized and it is not valid UTF-8, it
 is displayed as tokenized Base64. This makes it easily recognizable as a
 tokenized message and makes it simple to decode later from the text output (for
 example, with an updated token database).

 A binary message for which the token is not known may coincidentally be valid
 UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters.
 When decoding with an out-of-date token database, it is possible that some
 binary tokenized messages will be displayed as plain text rather than tokenized
 Base64.

 This situation is likely to occur, but should be infrequent. Even if it does
 happen, it is not a serious issue. A very small number of strings will be
 displayed incorrectly, but these strings cannot be decoded anyway. One nonsense
 string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``).
 Updating the token database would resolve the issue, though the non-Base64 logs
 would be difficult decode later from a log file.

 This situation can be avoided with the same approach described in
 `Accidentally interpreting plain text as tokenized binary`_. Appending
 an invalid UTF-8 character prevents the undecoded binary message from being
 interpreted as plain text.

 Python library
 ==============
 The ``pw_tokenizer.proto`` module defines functions that may be used to
 detokenize protobuf objects in Python. The function
 :func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields annotated as
 tokenized, replacing them with their detokenized version. For example:

 .. code-block:: python

   my_detokenizer = pw_tokenizer.Detokenizer(some_database)

   my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
   pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)

   assert my_message.tokenized_field == b'The detokenized string! Cool!'

 pw_tokenizer.proto
 ------------------
 .. automodule:: pw_tokenizer.proto
   :members:
	.. _module-pw_tokenizer-proto:

	------------------------------------
	Tokenized fields in protocol buffers
	------------------------------------
	Text may be represented in a few different ways:

	- Plain ASCII or UTF-8 text (``This is plain text``)
	- Base64-encoded tokenized message (``$ibafcA==``)
	- Binary-encoded tokenized message (``89 b6 9f 70``)
	- Little-endian 32-bit integer token (``0x709fb689``)

	``pw_tokenizer`` provides tools for working with protobuf fields that may
	contain tokenized text.

	Tokenized field protobuf option
	===============================
	``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option.
	This option may be applied to a protobuf field to indicate that it may contain a
	tokenized string. A string that is optionally tokenized is represented with a
	single ``bytes`` field annotated with ``(pw.tokenizer.format) =
	TOKENIZATION_OPTIONAL``.

	For example, the following protobuf has one field that may contain a tokenized
	string.

	.. code-block:: protobuf

	message MessageWithOptionallyTokenizedField {
	bytes just_bytes = 1;
	bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL];
	string just_text = 3;
	}

	Decoding optionally tokenized strings
	=====================================
	The encoding used for an optionally tokenized field is not recorded in the
	protobuf. Despite this, the text can reliably be decoded. This is accomplished
	by attempting to decode the field as binary or Base64 tokenized data before
	treating it like plain text.

	The following diagram describes the decoding process for optionally tokenized
	fields in detail.

	.. mermaid::

	flowchart TD
	start([Received bytes]) --> binary

	binary[Decode as<br>binary tokenized] --> binary_ok
	binary_ok{Detokenizes<br>successfully?} -->\|no\| utf8
	binary_ok -->\|yes\| done_binary([Display decoded binary])

	utf8[Decode as UTF-8] --> utf8_ok
	utf8_ok{Valid UTF-8?} -->\|no\| base64_encode
	utf8_ok -->\|yes\| base64

	base64_encode[Encode as<br>tokenized Base64] --> display
	display([Display encoded Base64])

	base64[Decode as<br>Base64 tokenized] --> base64_ok

	base64_ok{Fully<br>or partially<br>detokenized?} -->\|no\| is_plain_text
	base64_ok -->\|yes\| base64_results

	is_plain_text{Text is<br>printable?} -->\|no\| base64_encode
	is_plain_text-->\|yes\| plain_text

	base64_results([Display decoded Base64])
	plain_text([Display text])

	Potential decoding problems
	---------------------------
	The decoding process for optionally tokenized fields will yield correct results
	in almost every situation. In rare circumstances, it is possible for it to fail,
	but these can be avoided with a low-overhead mitigation if desired.

	There are two ways in which the decoding process may fail.

	Accidentally interpreting plain text as tokenized binary
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	If a plain-text string happens to decode as a binary tokenized message, the
	incorrect message could be displayed. This is very unlikely to occur. While many
	tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely
	that a device will happen to log one of these strings as plain text. The
	overwhelming majority of these strings will be nonsense.

	If an implementation wishes to guard against this extremely improbable
	situation, it is possible to prevent it. This situation is prevented by
	appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data
	that happens to be valid UTF-8 (or all binary tokenized messages, if desired).
	When decoding, if there is an extra 0xFF byte, it is discarded.

	Displaying undecoded binary as plain text instead of Base64
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	If a message fails to decode as binary tokenized and it is not valid UTF-8, it
	is displayed as tokenized Base64. This makes it easily recognizable as a
	tokenized message and makes it simple to decode later from the text output (for
	example, with an updated token database).

	A binary message for which the token is not known may coincidentally be valid
	UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters.
	When decoding with an out-of-date token database, it is possible that some
	binary tokenized messages will be displayed as plain text rather than tokenized
	Base64.

	This situation is likely to occur, but should be infrequent. Even if it does
	happen, it is not a serious issue. A very small number of strings will be
	displayed incorrectly, but these strings cannot be decoded anyway. One nonsense
	string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``).
	Updating the token database would resolve the issue, though the non-Base64 logs
	would be difficult decode later from a log file.

	This situation can be avoided with the same approach described in
	`Accidentally interpreting plain text as tokenized binary`_. Appending
	an invalid UTF-8 character prevents the undecoded binary message from being
	interpreted as plain text.

	Python library
	==============
	The ``pw_tokenizer.proto`` module defines functions that may be used to
	detokenize protobuf objects in Python. The function
	:func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields annotated as
	tokenized, replacing them with their detokenized version. For example:

	.. code-block:: python

	my_detokenizer = pw_tokenizer.Detokenizer(some_database)

	my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
	pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)

	assert my_message.tokenized_field == b'The detokenized string! Cool!'

	pw_tokenizer.proto
	------------------
	.. automodule:: pw_tokenizer.proto
	:members: