pw_tokenizer/token_databases.rst - pigweed/pigweed - Git at Google

 .. _module-pw_tokenizer-token-databases:

 ===============
 Token databases
 ===============
 .. pigweed-module-subpage::
    :name: pw_tokenizer

 Token databases store a mapping of tokens to the strings they represent. An ELF
 file can be used as a token database, but it only contains the strings for its
 exact build. A token database file aggregates tokens from multiple ELF files, so
 that a single database can decode tokenized strings from any known ELF.

 Token databases contain the token, removal date (if any), and string for each
 tokenized string.

 ----------------------
 Token database formats
 ----------------------
 Three token database formats are supported: CSV, binary, and directory. Tokens
 may also be read from ELF files or ``.a`` archives, but cannot be written to
 these formats.

 CSV database format
 ===================
 The CSV database format has three columns: the token in hexadecimal, the removal
 date (if any) in year-month-day format, and the string literal, surrounded by
 quotes. Quote characters within the string are represented as two quote
 characters.

 This example database contains six strings, three of which have removal dates.

 .. code-block::

    141c35d5,          ,"The answer: ""%s"""
    2e668cd6,2019-12-25,"Jello, world!"
    7b940e2a,          ,"Hello %s! %hd %e"
    851beeb6,          ,"%u %d"
    881436a0,2020-01-01,"The answer is: %s"
    e13b0f94,2020-04-01,"%llu"

 Binary database format
 ======================
 The binary database format is comprised of a 16-byte header followed by a series
 of 8-byte entries. Each entry stores the token and the removal date, which is
 0xFFFFFFFF if there is none. The string literals are stored next in the same
 order as the entries. Strings are stored with null terminators. See
 `token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
 for full details.

 The binary form of the CSV database is shown below. It contains the same
 information, but in a more compact and easily processed form. It takes 141 B
 compared with the CSV database's 211 B.

 .. code-block:: text

    [header]
    0x00: 454b4f54 0000534e  TOKENS..
    0x08: 00000006 00000000  ........

    [entries]
    0x10: 141c35d5 ffffffff  .5......
    0x18: 2e668cd6 07e30c19  ..f.....
    0x20: 7b940e2a ffffffff  *..{....
    0x28: 851beeb6 ffffffff  ........
    0x30: 881436a0 07e40101  .6......
    0x38: e13b0f94 07e40401  ..;.....

    [string table]
    0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22  The answer: "%s"
    0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48  .Jello, world!.H
    0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00  ello %s! %hd %e.
    0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72  %u %d.The answer
    0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00            is: %s.%llu.

 .. _module-pw_tokenizer-directory-database-format:

 Directory database format
 =========================
 pw_tokenizer can consume directories of CSV databases. A directory database
 will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all
 of which will be used for subsequent detokenization lookups.

 An example directory database might look something like this:

 .. code-block:: text

    directory_token_database
    ├── database.pw_tokenizer.csv
    ├── 9a8906c30d7c4abaa788de5634d2fa25.pw_tokenizer.csv
    └── b9aff81a03ad4d8a82a250a737285454.pw_tokenizer.csv

 This format is optimized for storage in a Git repository alongside source code.
 The token database commands randomly generate unique file names for the CSVs in
 the database to prevent merge conflicts. Running ``mark_removed`` or ``purge``
 commands in the database CLI consolidates the files to a single CSV.

 The database command line tool supports a ``--discard-temporary
 <upstream_commit>`` option for ``add``. In this mode, the tool attempts to
 discard temporary tokens. It identifies the latest CSV not present in the
 provided ``<upstream_commit>``, and tokens present that CSV that are not in the
 newly added tokens are discarded. This helps keep temporary tokens (e.g from
 debug logs) out of the database.

 JSON support
 ============
 While pw_tokenizer doesn't specify a JSON database format, a token database can
 be created from a JSON formatted array of strings. This is useful for side-band
 token database generation for strings that are not embedded as parsable tokens
 in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for
 instructions on generating a token database from a JSON file.

 .. _module-pw_tokenizer-managing-token-databases:

 ------------------------
 Managing token databases
 ------------------------
 Token databases are managed with the ``database.py`` script. This script can be
 used to extract tokens from compilation artifacts and manage database files.
 Invoke ``database.py`` with ``-h`` for full usage information.

 An example ELF file with tokenized logs is provided at
 ``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
 file to experiment with the ``database.py`` commands.

 .. _module-pw_tokenizer-database-creation:

 Create a database
 =================
 The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
 etc.), archives (.a), existing token databases (CSV or binary), or a JSON file
 containing an array of strings.

 .. code-block:: sh

    ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...

 Two database output formats are supported: CSV and binary. Provide
 ``--type binary`` to ``create`` to generate a binary database instead of the
 default CSV. CSV databases are great for checking into a source control or for
 human review. Binary databases are more compact and simpler to parse. The C++
 detokenizer library only supports binary databases currently.

 .. _module-pw_tokenizer-update-token-database:

 Update a database
 =================
 As new tokenized strings are added, update the database with the ``add``
 command.

 .. code-block:: sh

    ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...

 This command adds new tokens from ELF files or other databases to the database.
 Adding tokens already present in the database updates the date removed, if any,
 to the latest.

 A CSV token database can be checked into a source repository and updated as code
 changes are made. The build system can invoke ``database.py`` to update the
 database after each build.

 GN integration
 ==============
 Token databases may be updated or created as part of a GN build. The
 ``pw_tokenizer_database`` template provided by
 ``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
 strings database or creates a new database with artifacts from one or more GN
 targets or other database files.

 To create a new database, set the ``create`` variable to the desired database
 type (``"csv"`` or ``"binary"``). The database will be created in the output
 directory. To update an existing database, provide the path to the database with
 the ``database`` variable.

 .. code-block::

    import("//build_overrides/pigweed.gni")

    import("$dir_pw_tokenizer/database.gni")

    pw_tokenizer_database("my_database") {
      database = "database_in_the_source_tree.csv"
      targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
      input_databases = [ "other_database.csv" ]
    }

 Instead of specifying GN targets, paths or globs to output files may be provided
 with the ``paths`` option.

 .. code-block::

    pw_tokenizer_database("my_database") {
      database = "database_in_the_source_tree.csv"
      deps = [ ":apps" ]
      optional_paths = [ "$root_build_dir/**/*.elf" ]
    }

 .. note::

    The ``paths`` and ``optional_targets`` arguments do not add anything to
    ``deps``, so there is no guarantee that the referenced artifacts will exist
    when the database is updated. Provide ``targets`` or ``deps`` or build other
    GN targets first if this is a concern.

 CMake integration
 =================
 Token databases may be updated or created as part of a CMake build. The
 ``pw_tokenizer_database`` template provided by
 ``$dir_pw_tokenizer/database.cmake`` automatically updates an in-source tokenized
 strings database or creates a new database with artifacts from a CMake target.

 To create a new database, set the ``CREATE`` variable to the desired database
 type (``"csv"`` or ``"binary"``). The database will be created in the output
 directory.

 .. code-block::

    include("$dir_pw_tokenizer/database.cmake")

    pw_tokenizer_database("my_database") {
      CREATE binary
      TARGET my_target.ext
      DEPS ${deps_list}
    }

 To update an existing database, provide the path to the database with
 the ``database`` variable.

 .. code-block::

    pw_tokenizer_database("my_database") {
      DATABASE database_in_the_source_tree.csv
      TARGET my_target.ext
      DEPS ${deps_list}
    }

 .. _module-pw_tokenizer-collisions:

 ----------------
 Token collisions
 ----------------
 Tokens are calculated with a hash function. It is possible for different
 strings to hash to the same token. When this happens, multiple strings will have
 the same token in the database, and it may not be possible to unambiguously
 decode a token.

 The detokenization tools attempt to resolve collisions automatically. Collisions
 are resolved based on two things:

 - whether the tokenized data matches the strings arguments' (if any), and
 - if / when the string was marked as having been removed from the database.

 Resolving collisions
 ====================
 Collisions may occur occasionally. Run the command
 ``python -m pw_tokenizer.database report <database>`` to see information about a
 token database, including any collisions.

 If there are collisions, take the following steps to resolve them.

 - Change one of the colliding strings slightly to give it a new token.
 - In C (not C++), artificial collisions may occur if strings longer than
   ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider
   setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.  See
   ``pw_tokenizer/public/pw_tokenizer/config.h``.
 - Run the ``mark_removed`` command with the latest version of the build
   artifacts to mark missing strings as removed. This deprioritizes them in
   collision resolution.

   .. code-block:: sh

      python -m pw_tokenizer.database mark_removed --database <database> <ELF files>

   The ``purge`` command may be used to delete these tokens from the database.

 Probability of collisions
 =========================
 Hashes of any size have a collision risk. The probability of one at least
 one collision occurring for a given number of strings is unintuitively high
 (this is known as the `birthday problem
 <https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
 used for tokens, the probability of collisions increases substantially.

 This table shows the approximate number of strings that can be hashed to have a
 1% or 50% probability of at least one collision (assuming a uniform, random
 hash).

 +-------+---------------------------------------+
 | Token | Collision probability by string count |
 | bits  +--------------------+------------------+
 |       |         50%        |          1%      |
 +=======+====================+==================+
 |   32  |       77000        |        9300      |
 +-------+--------------------+------------------+
 |   31  |       54000        |        6600      |
 +-------+--------------------+------------------+
 |   24  |        4800        |         580      |
 +-------+--------------------+------------------+
 |   16  |         300        |          36      |
 +-------+--------------------+------------------+
 |    8  |          19        |           3      |
 +-------+--------------------+------------------+

 Keep this table in mind when masking tokens (see
 :ref:`module-pw_tokenizer-masks`). 16 bits might be acceptable when
 tokenizing a small set of strings, such as module names, but won't be suitable
 for large sets of strings, like log messages.
	.. _module-pw_tokenizer-token-databases:

	===============
	Token databases
	===============
	.. pigweed-module-subpage::
	:name: pw_tokenizer

	Token databases store a mapping of tokens to the strings they represent. An ELF
	file can be used as a token database, but it only contains the strings for its
	exact build. A token database file aggregates tokens from multiple ELF files, so
	that a single database can decode tokenized strings from any known ELF.

	Token databases contain the token, removal date (if any), and string for each
	tokenized string.

	----------------------
	Token database formats
	----------------------
	Three token database formats are supported: CSV, binary, and directory. Tokens
	may also be read from ELF files or ``.a`` archives, but cannot be written to
	these formats.

	CSV database format
	===================
	The CSV database format has three columns: the token in hexadecimal, the removal
	date (if any) in year-month-day format, and the string literal, surrounded by
	quotes. Quote characters within the string are represented as two quote
	characters.

	This example database contains six strings, three of which have removal dates.

	.. code-block::

	141c35d5, ,"The answer: ""%s"""
	2e668cd6,2019-12-25,"Jello, world!"
	7b940e2a, ,"Hello %s! %hd %e"
	851beeb6, ,"%u %d"
	881436a0,2020-01-01,"The answer is: %s"
	e13b0f94,2020-04-01,"%llu"

	Binary database format
	======================
	The binary database format is comprised of a 16-byte header followed by a series
	of 8-byte entries. Each entry stores the token and the removal date, which is
	0xFFFFFFFF if there is none. The string literals are stored next in the same
	order as the entries. Strings are stored with null terminators. See
	`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
	for full details.

	The binary form of the CSV database is shown below. It contains the same
	information, but in a more compact and easily processed form. It takes 141 B
	compared with the CSV database's 211 B.

	.. code-block:: text

	[header]
	0x00: 454b4f54 0000534e TOKENS..
	0x08: 00000006 00000000 ........

	[entries]
	0x10: 141c35d5 ffffffff .5......
	0x18: 2e668cd6 07e30c19 ..f.....
	0x20: 7b940e2a ffffffff *..{....
	0x28: 851beeb6 ffffffff ........
	0x30: 881436a0 07e40101 .6......
	0x38: e13b0f94 07e40401 ..;.....

	[string table]
	0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s"
	0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H
	0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e.
	0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer
	0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu.

	.. _module-pw_tokenizer-directory-database-format:

	Directory database format
	=========================
	pw_tokenizer can consume directories of CSV databases. A directory database
	will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all
	of which will be used for subsequent detokenization lookups.

	An example directory database might look something like this:

	.. code-block:: text

	directory_token_database
	├── database.pw_tokenizer.csv
	├── 9a8906c30d7c4abaa788de5634d2fa25.pw_tokenizer.csv
	└── b9aff81a03ad4d8a82a250a737285454.pw_tokenizer.csv

	This format is optimized for storage in a Git repository alongside source code.
	The token database commands randomly generate unique file names for the CSVs in
	the database to prevent merge conflicts. Running ``mark_removed`` or ``purge``
	commands in the database CLI consolidates the files to a single CSV.

	The database command line tool supports a ``--discard-temporary
	<upstream_commit>`` option for ``add``. In this mode, the tool attempts to
	discard temporary tokens. It identifies the latest CSV not present in the
	provided ``<upstream_commit>``, and tokens present that CSV that are not in the
	newly added tokens are discarded. This helps keep temporary tokens (e.g from
	debug logs) out of the database.

	JSON support
	============
	While pw_tokenizer doesn't specify a JSON database format, a token database can
	be created from a JSON formatted array of strings. This is useful for side-band
	token database generation for strings that are not embedded as parsable tokens
	in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for
	instructions on generating a token database from a JSON file.

	.. _module-pw_tokenizer-managing-token-databases:

	------------------------
	Managing token databases
	------------------------
	Token databases are managed with the ``database.py`` script. This script can be
	used to extract tokens from compilation artifacts and manage database files.
	Invoke ``database.py`` with ``-h`` for full usage information.

	An example ELF file with tokenized logs is provided at
	``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
	file to experiment with the ``database.py`` commands.

	.. _module-pw_tokenizer-database-creation:

	Create a database
	=================
	The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
	etc.), archives (.a), existing token databases (CSV or binary), or a JSON file
	containing an array of strings.

	.. code-block:: sh

	./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...

	Two database output formats are supported: CSV and binary. Provide
	``--type binary`` to ``create`` to generate a binary database instead of the
	default CSV. CSV databases are great for checking into a source control or for
	human review. Binary databases are more compact and simpler to parse. The C++
	detokenizer library only supports binary databases currently.

	.. _module-pw_tokenizer-update-token-database:

	Update a database
	=================
	As new tokenized strings are added, update the database with the ``add``
	command.

	.. code-block:: sh

	./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...

	This command adds new tokens from ELF files or other databases to the database.
	Adding tokens already present in the database updates the date removed, if any,
	to the latest.

	A CSV token database can be checked into a source repository and updated as code
	changes are made. The build system can invoke ``database.py`` to update the
	database after each build.

	GN integration
	==============
	Token databases may be updated or created as part of a GN build. The
	``pw_tokenizer_database`` template provided by
	``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
	strings database or creates a new database with artifacts from one or more GN
	targets or other database files.

	To create a new database, set the ``create`` variable to the desired database
	type (``"csv"`` or ``"binary"``). The database will be created in the output
	directory. To update an existing database, provide the path to the database with
	the ``database`` variable.

	.. code-block::

	import("//build_overrides/pigweed.gni")

	import("$dir_pw_tokenizer/database.gni")

	pw_tokenizer_database("my_database") {
	database = "database_in_the_source_tree.csv"
	targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
	input_databases = [ "other_database.csv" ]
	}

	Instead of specifying GN targets, paths or globs to output files may be provided
	with the ``paths`` option.

	.. code-block::

	pw_tokenizer_database("my_database") {
	database = "database_in_the_source_tree.csv"
	deps = [ ":apps" ]
	optional_paths = [ "$root_build_dir/*/.elf" ]
	}

	.. note::

	The ``paths`` and ``optional_targets`` arguments do not add anything to
	``deps``, so there is no guarantee that the referenced artifacts will exist
	when the database is updated. Provide ``targets`` or ``deps`` or build other
	GN targets first if this is a concern.

	CMake integration
	=================
	Token databases may be updated or created as part of a CMake build. The
	``pw_tokenizer_database`` template provided by
	``$dir_pw_tokenizer/database.cmake`` automatically updates an in-source tokenized
	strings database or creates a new database with artifacts from a CMake target.

	To create a new database, set the ``CREATE`` variable to the desired database
	type (``"csv"`` or ``"binary"``). The database will be created in the output
	directory.

	.. code-block::

	include("$dir_pw_tokenizer/database.cmake")

	pw_tokenizer_database("my_database") {
	CREATE binary
	TARGET my_target.ext
	DEPS ${deps_list}
	}

	To update an existing database, provide the path to the database with
	the ``database`` variable.

	.. code-block::

	pw_tokenizer_database("my_database") {
	DATABASE database_in_the_source_tree.csv
	TARGET my_target.ext
	DEPS ${deps_list}
	}

	.. _module-pw_tokenizer-collisions:

	----------------
	Token collisions
	----------------
	Tokens are calculated with a hash function. It is possible for different
	strings to hash to the same token. When this happens, multiple strings will have
	the same token in the database, and it may not be possible to unambiguously
	decode a token.

	The detokenization tools attempt to resolve collisions automatically. Collisions
	are resolved based on two things:

	- whether the tokenized data matches the strings arguments' (if any), and
	- if / when the string was marked as having been removed from the database.

	Resolving collisions
	====================
	Collisions may occur occasionally. Run the command
	``python -m pw_tokenizer.database report <database>`` to see information about a
	token database, including any collisions.

	If there are collisions, take the following steps to resolve them.

	- Change one of the colliding strings slightly to give it a new token.
	- In C (not C++), artificial collisions may occur if strings longer than
	``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider
	setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value. See
	``pw_tokenizer/public/pw_tokenizer/config.h``.
	- Run the ``mark_removed`` command with the latest version of the build
	artifacts to mark missing strings as removed. This deprioritizes them in
	collision resolution.

	.. code-block:: sh

	python -m pw_tokenizer.database mark_removed --database <database> <ELF files>

	The ``purge`` command may be used to delete these tokens from the database.

	Probability of collisions
	=========================
	Hashes of any size have a collision risk. The probability of one at least
	one collision occurring for a given number of strings is unintuitively high
	(this is known as the `birthday problem
	<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
	used for tokens, the probability of collisions increases substantially.

	This table shows the approximate number of strings that can be hashed to have a
	1% or 50% probability of at least one collision (assuming a uniform, random
	hash).

	+-------+---------------------------------------+
	\| Token \| Collision probability by string count \|
	\| bits +--------------------+------------------+
	\| \| 50% \| 1% \|
	+=======+====================+==================+
	\| 32 \| 77000 \| 9300 \|
	+-------+--------------------+------------------+
	\| 31 \| 54000 \| 6600 \|
	+-------+--------------------+------------------+
	\| 24 \| 4800 \| 580 \|
	+-------+--------------------+------------------+
	\| 16 \| 300 \| 36 \|
	+-------+--------------------+------------------+
	\| 8 \| 19 \| 3 \|
	+-------+--------------------+------------------+

	Keep this table in mind when masking tokens (see
	:ref:`module-pw_tokenizer-masks`). 16 bits might be acceptable when
	tokenizing a small set of strings, such as module names, but won't be suitable
	for large sets of strings, like log messages.