| .. _module-pw_tokenizer-token-databases: |
| |
| =============== |
| Token databases |
| =============== |
| .. pigweed-module-subpage:: |
| :name: pw_tokenizer |
| |
| Token databases store a mapping of tokens to the strings they represent. An ELF |
| file can be used as a token database, but it only contains the strings for its |
| exact build. A token database file aggregates tokens from multiple ELF files, so |
| that a single database can decode tokenized strings from any known ELF. |
| |
| Token databases contain the token, removal date (if any), and string for each |
| tokenized string. |
| |
| ---------------------- |
| Token database formats |
| ---------------------- |
| Three token database formats are supported: CSV, binary, and directory. Tokens |
| may also be read from ELF files or ``.a`` archives, but cannot be written to |
| these formats. |
| |
| CSV database format |
| =================== |
| The CSV database format has four columns: the token in hexadecimal, the removal |
| date (if any) in year-month-day format, the token domain, and the string |
| literal. The domain and string are quoted, and quote characters within the |
| domain or string are represented as two quote characters. |
| |
| This example database contains six strings, three of which have removal dates. |
| |
| .. code-block:: |
| |
| 141c35d5, ,"","The answer: ""%s""" |
| 2e668cd6,2019-12-25,"","Jello, world!" |
| 7a22c974, ,"metrics","%f" |
| 7b940e2a, ,"","Hello %s! %hd %e" |
| 851beeb6, ,"","%u %d" |
| 881436a0,2020-01-01,"","The answer is: %s" |
| e13b0f94,2020-04-01,"metrics","%llu" |
| |
| Legacy CSV databases did not include the domain, so only had three columns. |
| These databases are still supported, but tokens are always in the default domain |
| (``""``). |
| |
| Binary database format |
| ====================== |
| The binary database format is comprised of a 16-byte header followed by a series |
| of 8-byte entries. Each entry stores the token and the removal date, which is |
| 0xFFFFFFFF if there is none. The string literals are stored next in the same |
| order as the entries. Strings are stored with null terminators. See |
| `token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_ |
| for full details. |
| |
| The binary form of the CSV database is shown below. It contains the same |
| information, but in a more compact and easily processed form. It takes 141 B |
| compared with the CSV database's 211 B. |
| |
| .. code-block:: text |
| |
| [header] |
| 0x00: 454b4f54 0000534e TOKENS.. |
| 0x08: 00000006 00000000 ........ |
| |
| [entries] |
| 0x10: 141c35d5 ffffffff .5...... |
| 0x18: 2e668cd6 07e30c19 ..f..... |
| 0x20: 7b940e2a ffffffff *..{.... |
| 0x28: 851beeb6 ffffffff ........ |
| 0x30: 881436a0 07e40101 .6...... |
| 0x38: e13b0f94 07e40401 ..;..... |
| |
| [string table] |
| 0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s" |
| 0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H |
| 0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e. |
| 0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer |
| 0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu. |
| |
| .. _module-pw_tokenizer-directory-database-format: |
| |
| Directory database format |
| ========================= |
| pw_tokenizer can consume directories of CSV databases. A directory database |
| will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all |
| of which will be used for subsequent detokenization lookups. |
| |
| An example directory database might look something like this: |
| |
| .. code-block:: text |
| |
| directory_token_database |
| ├── database.pw_tokenizer.csv |
| ├── 9a8906c30d7c4abaa788de5634d2fa25.pw_tokenizer.csv |
| └── b9aff81a03ad4d8a82a250a737285454.pw_tokenizer.csv |
| |
| This format is optimized for storage in a Git repository alongside source code. |
| The token database commands randomly generate unique file names for the CSVs in |
| the database to prevent merge conflicts. Running ``mark_removed`` or ``purge`` |
| commands in the database CLI consolidates the files to a single CSV. |
| |
| The database command line tool supports a ``--discard-temporary |
| <upstream_commit>`` option for ``add``. In this mode, the tool attempts to |
| discard temporary tokens. It identifies the latest CSV not present in the |
| provided ``<upstream_commit>``, and tokens present that CSV that are not in the |
| newly added tokens are discarded. This helps keep temporary tokens (e.g from |
| debug logs) out of the database. |
| |
| ELF section database format |
| =========================== |
| During compilation, pw_tokenizer stores its entries in an ELF section. Entries |
| are stored as a header followed by the domain and string. The format for these |
| entries is described below. |
| |
| .. literalinclude:: public/pw_tokenizer/internal/tokenize_string.h |
| :start-after: [pw_tokenizer-elf-entry] |
| :end-before: [pw_tokenizer-elf-entry] |
| :language: C |
| |
| JSON support |
| ============ |
| While pw_tokenizer doesn't specify a JSON database format, a token database can |
| be created from a JSON formatted array of strings. This is useful for side-band |
| token database generation for strings that are not embedded as parsable tokens |
| in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for |
| instructions on generating a token database from a JSON file. |
| |
| .. _module-pw_tokenizer-managing-token-databases: |
| |
| ------------------------ |
| Managing token databases |
| ------------------------ |
| Token databases are managed with the ``database.py`` script. This script can be |
| used to extract tokens from compilation artifacts and manage database files. |
| Invoke ``database.py`` with ``-h`` for full usage information. |
| |
| An example ELF file with tokenized logs is provided at |
| ``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that |
| file to experiment with the ``database.py`` commands. |
| |
| .. _module-pw_tokenizer-database-creation: |
| |
| Create a database |
| ================= |
| The ``create`` command makes a new token database from ELF files (.elf, .o, .so, |
| etc.), archives (.a), existing token databases (CSV or binary), or a JSON file |
| containing an array of strings. |
| |
| .. code-block:: console |
| |
| $ ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE... |
| |
| Two database output formats are supported: CSV and binary. Provide |
| ``--type binary`` to ``create`` to generate a binary database instead of the |
| default CSV. CSV databases are great for checking into a source control or for |
| human review. Binary databases are more compact and simpler to parse. The C++ |
| detokenizer library only supports binary databases currently. |
| |
| .. _module-pw_tokenizer-update-token-database: |
| |
| Update a database |
| ================= |
| As new tokenized strings are added, update the database with the ``add`` |
| command. |
| |
| .. code-block:: console |
| |
| $ ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE... |
| |
| This command adds new tokens from ELF files or other databases to the database. |
| Adding tokens already present in the database updates the date removed, if any, |
| to the latest. |
| |
| A CSV token database can be checked into a source repository and updated as code |
| changes are made. The build system can invoke ``database.py`` to update the |
| database after each build. |
| |
| GN integration |
| ============== |
| Token databases may be updated or created as part of a GN build. The |
| ``pw_tokenizer_database`` template provided by |
| ``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized |
| strings database or creates a new database with artifacts from one or more GN |
| targets or other database files. |
| |
| To create a new database, set the ``create`` variable to the desired database |
| type (``"csv"`` or ``"binary"``). The database will be created in the output |
| directory. To update an existing database, provide the path to the database with |
| the ``database`` variable. |
| |
| .. code-block:: |
| |
| import("//build_overrides/pigweed.gni") |
| |
| import("$dir_pw_tokenizer/database.gni") |
| |
| pw_tokenizer_database("my_database") { |
| database = "database_in_the_source_tree.csv" |
| targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ] |
| input_databases = [ "other_database.csv" ] |
| } |
| |
| Instead of specifying GN targets, paths or globs to output files may be provided |
| with the ``paths`` option. |
| |
| .. code-block:: |
| |
| pw_tokenizer_database("my_database") { |
| database = "database_in_the_source_tree.csv" |
| deps = [ ":apps" ] |
| optional_paths = [ "$root_build_dir/**/*.elf" ] |
| } |
| |
| .. note:: |
| |
| The ``paths`` and ``optional_targets`` arguments do not add anything to |
| ``deps``, so there is no guarantee that the referenced artifacts will exist |
| when the database is updated. Provide ``targets`` or ``deps`` or build other |
| GN targets first if this is a concern. |
| |
| CMake integration |
| ================= |
| Token databases may be updated or created as part of a CMake build. The |
| ``pw_tokenizer_database`` template provided by |
| ``$dir_pw_tokenizer/database.cmake`` automatically updates an in-source tokenized |
| strings database or creates a new database with artifacts from a CMake target. |
| |
| To create a new database, set the ``CREATE`` variable to the desired database |
| type (``"csv"`` or ``"binary"``). The database will be created in the output |
| directory. |
| |
| .. code-block:: |
| |
| include("$dir_pw_tokenizer/database.cmake") |
| |
| pw_tokenizer_database("my_database") { |
| CREATE binary |
| TARGET my_target.ext |
| DEPS ${deps_list} |
| } |
| |
| To update an existing database, provide the path to the database with |
| the ``database`` variable. |
| |
| .. code-block:: |
| |
| pw_tokenizer_database("my_database") { |
| DATABASE database_in_the_source_tree.csv |
| TARGET my_target.ext |
| DEPS ${deps_list} |
| } |
| |
| .. _module-pw_tokenizer-collisions: |
| |
| ---------------- |
| Token collisions |
| ---------------- |
| Tokens are calculated with a hash function. It is possible for different |
| strings to hash to the same token. When this happens, multiple strings will have |
| the same token in the database, and it may not be possible to unambiguously |
| decode a token. |
| |
| The detokenization tools attempt to resolve collisions automatically. Collisions |
| are resolved based on two things: |
| |
| - whether the tokenized data matches the strings arguments' (if any), and |
| - if / when the string was marked as having been removed from the database. |
| |
| Resolving collisions |
| ==================== |
| Collisions may occur occasionally. Run the command |
| ``python -m pw_tokenizer.database report <database>`` to see information about a |
| token database, including any collisions. |
| |
| If there are collisions, take the following steps to resolve them. |
| |
| - Change one of the colliding strings slightly to give it a new token. |
| - In C (not C++), artificial collisions may occur if strings longer than |
| ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider |
| setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value. See |
| ``pw_tokenizer/public/pw_tokenizer/config.h``. |
| - Run the ``mark_removed`` command with the latest version of the build |
| artifacts to mark missing strings as removed. This deprioritizes them in |
| collision resolution. |
| |
| .. code-block:: console |
| |
| $ python -m pw_tokenizer.database mark_removed --database <database> <ELF files> |
| |
| The ``purge`` command may be used to delete these tokens from the database. |
| |
| Probability of collisions |
| ========================= |
| Hashes of any size have a collision risk. The probability of one at least |
| one collision occurring for a given number of strings is unintuitively high |
| (this is known as the `birthday problem |
| <https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are |
| used for tokens, the probability of collisions increases substantially. |
| |
| This table shows the approximate number of strings that can be hashed to have a |
| 1% or 50% probability of at least one collision (assuming a uniform, random |
| hash). |
| |
| +-------+---------------------------------------+ |
| | Token | Collision probability by string count | |
| | bits +--------------------+------------------+ |
| | | 50% | 1% | |
| +=======+====================+==================+ |
| | 32 | 77000 | 9300 | |
| +-------+--------------------+------------------+ |
| | 31 | 54000 | 6600 | |
| +-------+--------------------+------------------+ |
| | 24 | 4800 | 580 | |
| +-------+--------------------+------------------+ |
| | 16 | 300 | 36 | |
| +-------+--------------------+------------------+ |
| | 8 | 19 | 3 | |
| +-------+--------------------+------------------+ |
| |
| Keep this table in mind when masking tokens (see |
| :ref:`module-pw_tokenizer-masks`). 16 bits might be acceptable when |
| tokenizing a small set of strings, such as module names, but won't be suitable |
| for large sets of strings, like log messages. |