The PSA crypto subsystem includes a persistent key store. It is possible to create a persistent key and read it back later. This must work even if the underlying storage exhibits non-nominal behavior. In this document, resilience means correct behavior of the key store even under if the underlying platform behaves in a non-nominal, but still partially controlled way.
At this point, we are only concerned about one specific form of resilience: to a system crash or power loss. That is, we assume that the underlying platform behaves nominally, except that occasionally it may restart. In the field, this can happen due to a sudden loss of power.
This document explores the problem space, defines a library design and a test design.
Goal: PSA Crypto API functions are atomic and committing.
Atomic means that when an application calls an API function, as far as the application is concerned, at any given point in time, the system is either in a state where the function has not started yet, or in a state where the function has returned. The application never needs to worry about an intermediate state.
Committing means that when a function returns, the data has been written to the persistent storage. As a consequence, if the system restarts during a sequence of storage modifications $M_1, M_2, \ldots, M_n$, we know that when the system restarts, a prefix of the sequence has been performed. For example, there will never be a situation where $M_2$ has been performed but not $M_1$.
The committing property is important not only for sequences of operations, but also when reporting the result of an operation to an external system. For example, if a key creation function in the PSA Crypto API reports to the application that a key has been created, and the application reports to a server that the key has been created, it is guaranteed that the key exists even if the system restarts.
PSA relies on a PSA ITS (Internal Trusted Storage) interface, which exposes a simple API. There are two functions to modify files:
set()
writes a whole file (either creating it, or replacing the previous content).remove()
removes a file (returning a specific error code if the file does not exist).Assumption: the underlying ITS functions are atomic and committing.
Since the underlying functions are atomic, the content of a file is always a version that was previously passed to set()
. We do not try to handle the case where a file might be partially written.
For a transparent key, all key management operations (creation or destruction) on persistent keys rely on a single call to the underlying storage (set()
for a key creation, remove()
for a key destruction). This also holds for an opaque key stored in a secure element that does not have its own key store: in this case, the core stores a wrapped (i.e. encrypted) copy of the key material, but this does not impact how the core interacts with the storage. Other API functions do not modify the storage.
The following case requires extra work related to resilience:
As a consequence, apart from the listed cases, the API calls inherit directly from the resilience properties of the underyling storage. We do not need to take any special precautions in the library design, and we do not need to perform any testing of resilience for transparent keys.
(This section was last updated for Mbed TLS 3.4.0 implementing PSA Crypto API 1.1.)
In this section, we use “(stateful) secure element key” to mean a key stored in a stateful secure element, i.e. a secure element that stores keys. This excludes keys in a stateleess secure element for which the core stores a wrapped copy of the key. We study the problem of how key management in stateful secure elements interacts with storage and explore the design space.
Assumption: driver calls for key management in stateful secure elements are atomic and committing.
(For stateless secure elements, this assumption is vacuously true.)
For a secure element key, key management requires a commitment on both sites. For example, consider a successful key creation operation:
If the core loses power between steps 1 and 2, the key does not exist yet. This is fine from an application‘s perspective since the core has not committed to the key’s existence, but the core needs to take care not to leave resources in storage that are related to the non-existent key. If the core loses power between steps 2 and 3, the key exists in the secure element. From an application's perspective, the core may either report that the key exists or that it does not exist, but in the latter case, the core needs to free the key in the secure element, to avoid leaving behind inaccessible resources.
As a consequence, the content of the storage cannot remain the same between the end of step 1 and the end of step 3, since the core must behave differently depending on whether step 2 has taken place.
Accomplishing a transaction across system boundaries is a well-known problem in database management, with a well-known solution: two-phase commit.
With a key in a stateful secure element, a successful creation process goes as follows (see “Key management in a secure element with storage” in the driver interface specification):
"allocate_key"
entry point.If there is a loss of power:
Key destruction goes as follows:
"destroy_key"
entry point.If there is a loss of power:
In both cases, upon restart, the core needs to perform a transaction recovery. When a power loss happens, the core decides whether to commit or abort the transaction.
Note that the analysis in this section assumes that the driver does not update its persistent state during a key management operation (or at least not in a way that is influences the key management process — for example, it might renew an authorization token).
We assume that power failures are rare. Therefore we will primarily optimize for the normal case. Transaction recovery needs to be practical, but does not have to be fully optimized.
The main quantity we will optimize for is the number of storage updates in the nominal case. This is good for performance because storage writes are likely to dominate the runtime in some hardware configurations where storage writes are slow and communication with the secure element is fast, for key management operations that require a small amount of computation. In addition, minimizing the number of storage updates is good for the longevity of flash media.
The PSA ITS API does not support enumerating files in storage: an ITS call can only access one file identifier. Therefore transaction recovery cannot be done by traversing files whose name is or encodes the key identifier. It must start by traversing a small number of files whose names are independent of the key identifiers involved.
Per the assumptions on the underlying file storage, each atomic operation in the internal storage concerns a single file: either removing it, or setting its content. Furthermore there is no way to enumerate the files in storage.
A key creation function must transform the internal storage from a state where file id
does not exist, to a state where file id
exists and has its desired final content (containing the key attributes and the driver's key identifier). The situation is similar with key destruction, except that the initial and final states are exchanged. Neither the initial state nor the final state reference id
otherwise.
For a key that is not in a stateful element, the transaction consists of a single write operation. As discussed previously, this is not possible with a stateful secure element because the state of the internal storage needs to change both before and after the state change in the secure element. No other single-write algorithm works.
If there is a power failure around the time of changing the state of the secure element, there must be information in the internal storage that indicates that key id
has a transaction in progress. The file id
cannot be used for this purpose because there is no way to enumerate all keys (and even if there was, it would not be practical). Therefore the transaction will need to modify some other file t
with a fixed name (a name that doesn't depend on the key). Since the final system state should be identical to the initial state except for the file id
, the minimum number of storage operations for a transaction is 3:
t
referencing id
.id
.t
to its initial state.The strategies discussed in the overview above follow this pattern, with t
being the file containing the transaction list that the recovery consults. We have just proved that this pattern is optimal.
Note that this pattern requires the state of id
to be modified only once. In particular, if a key management involves writing an intermediate state for id
before modifying the secure element state and writing a different state after that, this will require a total of 4 updates to internal storage. Since we want to minimize the number of storage updates, we will not explore designs that involved updating id
twice or more.
When the core starts, it needs to know about transaction(s) that need to be resumed. This information will be stored in a persistent “transaction list”, with one entry per key. In this section, we explore recovery strategies, and we determine what the transaction list needs to contain as well as when it needs to be updated. Other sections will explore the format of the transaction list, as well as how many keys it needs to contain.
There are four cases for recovery when a transaction is in progress. In each case, the core can either decide to commit the transaction (which may require replaying the interrupted part) or abort it (which may require a rewind in the secure element). It may call the secure element driver's "get_key_attributes"
entry point to find out whether the key is present.
"allocate_key"
which will in general changing the key's driver identifier, which complicates state management in the core. Given the likely complexity, we exclude this strategy.From the analysis above, assuming that all keys are treated in the same way, there are 4 possible strategies.
"get_key_attributes"
entry point. Recovery means resuming the operation where it left off. For key creation, this means that the key metadata needs to be saved before calling the secure element's key creation entry point."get_key_attributes"
entry point. Compared to always following the state of the secure element, this has the advantage of maximizing the chance that a command to destroy key material is effective. Compared to always destroying the key, this has a performance advantage if a key creation is interrupted. These do not seem like decisive advantages, so we will not consider this strategy further.Requiring the driver to have a "get_key_attributes"
entry point is potentially problematic because some secure elements don‘t have room to store key attributes: a key slot always exists, and it’s up to the user to remember what, if anything, they put in it. The driver has to remember anyway, so that it can find a free slot when creating a key. But with a recovery strategy that doesn‘t involve a "get_key_attributes"
entry point, the driver design is easier: the driver doesn’t need to protect the information about slots in use against a power failure, the core takes care of that.
Each entry in the transaction list contains the API key identifier, the key lifetime (or at least the location), the driver key identifier (not constant-size), and an indication of whether the key is being created or destroyed.
For key creation, we have all the information to store in the key file once the "allocate_key"
call returns. We must store all the information that will go in the key file before calling the driver's key creation entry point. Therefore the normal sequence of operations is:
"allocate_key"
entry point.During recovery, for each key in the transaction list that was being created:
For key destruction, we need to preserve the key file until after the key has been destroyed. Therefore the normal sequence of operations is:
"destroy_key"
entry point.During recovery, for each key in the transaction list that was being created:
"destroy_key"
entry point, then remove the key file, and finally remote the key from the transaction lits.Each entry in the transaction list contains the API key identifier, the key lifetime (or at least the location), and the driver key identifier (not constant-size).
For key creation, we do not need to store the key's metadata until it has been created in the secure element. Therefore the normal sequence of operations is:
"allocate_key"
entry point.For key destruction, we can remove the key file before contacting the secure element. Therefore the normal sequence of operations is:
"destroy_key"
entry point.Recovery means removing all traces of all keys on the transaction list. This means following the destruction process, starting after the point where the key has been added to the transaction list, and ignoring any failure of a removal action if the item to remove does not exist:
DOES_NOT_EXIST
as a success."destroy_key"
entry point, treating DOES_NOT_EXIST
as a success.We can modify the always-destroy strategy to make the transaction file simpler: if we ensure that the key file always exists if the key exists in the secure element, then the transaction list does not need to include the driver key identifier: it can be read from the key file.
For key creation, we need to store the key's metadata before creating in the secure element. Therefore the normal sequence of operations is:
"allocate_key"
entry point.For key destruction, we need to contact the secure element before removing the key file. Therefore the normal sequence of operations is:
"destroy_key"
entry point.Recovery means removing all traces of all keys on the transaction list. This means following the destruction process, starting after the point where the key has been added to the transaction list, and ignoring any failure of a removal action if the item to remove does not exist:
"destroy_key"
entry point, treating DOES_NOT_EXIST
as a success.DOES_NOT_EXIST
as a success.Compared with the basic always-destroy strategy:
"allocate_key"
has no analog in the stateless secure element or transparent cases.The actions of the secure element driver may themselves be non-atomic. So the driver must be given a chance to perform recovery.
To simplify the design of the driver, the core should guarantee that the driver will know if a transaction was in progress and the core cannot be sure about the state of the secure element. Merely calling a read-only entry point such as "get_key_attributes"
does not provide enough information to the driver for it to know that it should actively perform recovery related to that key.
This gives an advantage to the “always destroy” strategy. Under this strategy, if the key might be in a transitional state, the core will request a key destruction from the driver. This means that, if the driver has per-key auxiliary data to clean up, it can bundle that as part of the key's destruction.
In this section, we discuss how to test non-atomic processes that must implement an atomic and committing interface. As discussed in “Overview of API functions”, this concerns key management in stateful secure elements.
Non-atomic processes consist of a series of atomic, committing steps.
Our general strategy to test them is as follows: every time there is a modification of persistent state, either in storage or in the (simulated) secure element, try both the nominal case and simulating a power loss. If a power loss occurs, restart the system (i.e. clean up and call psa_crypto_init()
), and check that the system ends up in a consistent state.
Note that this creates a binary tree of possibilities: after each state modification, there may or may not be a restart, and after that different state modifications may occur, each of which may or may not be followed by a restart.
For example, consider testing of one key creation operation (see “Overview of two-phase commit with stateful secure elements”, under the simplifying assumption that each storage update step, as well as the recovery after a restart, each make a single (atomic) storage modification and no secure element access. The nominal case consists of three state modifications: storage modification (start transaction), creation on the secure element, storage modification (commit transaction). We need to test the following sequences:
If, for example, recovery consists of two atomic steps, the tree of possibilities expands and may be infinite:
In order to limit the possibilities, we need to make some assumptions about the recovery step. For example, if we have confidence that recovery step 1 is idempotent (i.e. doing it twice is the same as doing it once), we don't need to test what happens in execution sequences that take recovery step 1 more than twice in a row.
We introduce an abstraction level in transaction recovery:
This separation of concerns greatly facilitates testing, since it is now split into two parts:
Another benefit of this approach is that it is easier to specify and test what happens if the library is updated on a device with leftovers from an interrupted transaction. We will require and test that the new version of the library supports recovery of the old library's states, without worrying how those states were reached.
As discussed in the section “Recovery strategies”, the information about active transactions is stored in a transaction list file. The name of the transaction list file does not depend on the identifiers of the keys in the list, but there may be more than one transaction list, for example one per secure element. If so, each transaction list can be considered independently.
When no transaction is in progress, the transaction list does not exist, or is empty. The empty case must be supported because this is the initial state of the filesystem. When no transaction is in progress, the state of the secure element must be consistent with references to keys in that secure element contained in key files. More generally, if a key is not in the transaction list, then the key must be present in the secure element if and only if the key file is in the internal storage.
For the purposes of the state invariant, it matters whether the transaction list file contains the driver key identifier, or if the driver key identifier is only stored in the key file. This is because the core needs to know the driver key id in order to access the secure element. If the transaction list does not contain the driver key identifier, and the key file does not exist, the key must not be present in the secure element.
We thus have two scenarios, each with their own invariant: one where the transaction list contains only key identifiers, and one where it also contains the secure element's key identifier (as well as the location of the secure element if this is not encoded in the name of the transaction list file).
Invariants:
id
does not exist, then no resources corresponding to that key are in a secure element. This holds whether id
is in the transaction list or not.id
is not in the transaction list and the file id
exists and references a key in a stateful secure element, then the key is present in the secure element.If id
is in the transaction list and the file id
exists, the key may or may not be present in the secure element.
The invariant imposes constraints on the order of operations for the two-phase commit: key creation must create id
before calling the secure element‘s key creation entry point, and key destruction must remove id
after calling the secure element’s key destruction entry point.
For recovery:
id
does not exist, then nothing needs to be done for recovery, other than removing id
from the transaction list.id
exists:DOES_NOT_EXIST
error as a success), then remove id
.id
. If not, remove id
from the internal storage.Invariants:
id
is not in the transaction list and the file id
does not exist, then no resources corresponding to that key are in a secure element.id
is not in the transaction list and the file id
exists, then the key is present in the secure element.If id
is in the transaction list, neither the state of id
in the internal storage nor the state of the key in the secure element is known.
For recovery:
id
does not exist, then destroy the key in the secure element (treating a DOES_NOT_EXIST
error as a success).id
exists:DOES_NOT_EXIST
error as a success), then remove id
.id
. If not, remove id
from the internal storage.For a given key, we have to consider three a priori independent boolean states:
There is full coverage for one key if we have tests of recovery for the states among these $2^3 = 8$ possibilities that satisfy the storage invariant.
In addition, testing should adequately cover the case of multiple keys in the transaction list. How much coverage is adequate depends on the layout of the list as well as white-box considerations of how the list is manipulated.
Based on “Optimization considerations for transactions”, we choose a transaction algorithm that consists in the following operations:
In addition, before or after step 2, create or remove the key file in the internal storage.
In order to conveniently support multiple transactions at the same time, we pick the simplest possible layout for the transaction list: a simple array of key identifiers. Since the transaction list does not contain the driver key identifier:
This choice of algorithm does not require the secure element driver to have a "get_key_attributes"
entry point.
The storage invariant is as follows:
id
does not exist, then no resources corresponding to that key are in a secure element. This holds whether id
is in the transaction list or not.id
is not in the transaction list and the file id
exists and references a key in a stateful secure element, then the key is present in the secure element.id
is in the transaction list and a key exists by that identifier, the key's location is a stateful secure element.To assist secure element drivers with recovery, we pick the always-destroy recovery strategy with a simple transaction file. The the recovery process is as follows:
id
does not exist, then nothing needs to be done for recovery, other than removing id
from the transaction list.id
exists, call the secure element's key destruction entry point (treating a DOES_NOT_EXIST
error as a success), then remove id
.This section only concerns stateful secure elements as discussed in “Designing key management for secure element keys”, i.e. secure elements with an "allocate_key"
entry point. The design follows the general principle described in “Overview of two-phase commit with stateful secure elements” and the specific choices justified in “Choice of a transaction design”.
The transaction list is a simple array of key identifiers.
To add a key identifier to the list:
To remove a key identifier from the list:
Let A be the application key identifier.
"allocate_key"
entry point, obtaining the driver key identifier D chosen by the driver.If any step fails:
Note that this process is identical to key destruction, except that the key is already in the transaction list.
Let A be the application key identifier.
We assume that the key is loaded in a key slot in memory: the core needs to know the key's location in order to determine whether the key is in a stateful secure element, and if so to know the driver key identifier. A possible optimization would be to load only that information in local variables, without occupying a key store; this has the advantage that key destruction works even if the key store is full.
"destroy_key"
entry point.If any step fails, remember the error but continue the process, to destroy the resources associated with the key as much as is practical.
For each key A in the transaction list file, if the file A exists in the internal storage:
"destroy_key"
entry point.The transaction list file can be processed in any order.
It is correct to update the transaction list after recovering each key, or to only delete the transaction list file once the recovery is over.
The transaction list file contains a fixed header followed by a list of fixed-size elements.
The file uid is PSA_CRYPTO_ITS_TRANSACTION_LIST_UID
= 0xffffff53.
sizeof(psa_storage_uid_t)
. Storing this size avoids reading bad data if Mbed TLS is upgraded to a different integration that names keys differently.In practice, there will rarely be more than one active transaction at a time, so the size of an element is not critical for efficiency. Therefore, in addition to the key identifier which is required, we add some potentially useful information in case it becomes useful later. We do not put the driver key identifier because its size is not a constant.
sizeof(psa_storage_uid_t)
bytes.sizeof(psa_key_lifetime_t)
). Currently unused during recovery.Note that the code base already references a “transaction file” (PSA_CRYPTO_ITS_TRANSACTION_UID
= 0xffffff54), used by dynamic secure elements (feature enabled with MBEDTLS_PSA_CRYPTO_SE_C
). This is a deprecated feature that has not been fully implemented: when this feature is enabled, the transaction file gets written during transactions, but if it exists when PSA crypto starts, psa_crypto_init()
fails because recovery has never been implemented.
For the new kind of secure element driver, we pick a different file name to avoid any mixup.
When MBEDTLS_TEST_HOOKS
is enabled, each call to psa_its_set()
or psa_its_remove()
also calls a test hook, passing the file UID as an argument to the hook.
When a stateful secure element driver is present in the build, we use this hook to verify that the storage respects the storage invariant. In addition, if there is some information about key ongoing operation (set explicitly by the test function as a global variable in the test framework), the hook tests that the content of the storage is compatible with the ongoing operation.
The storage invariant check cannot check all keys in storage, and does not need to (for example, it would be pointless to check anything about transparent keys). It checks the following keys:
Some tests set global variables to indicate which persistent keys they manipulate. We instrument at least some of these tests to also indicate what operation is in progress on the key. See the GitHub issues or the source code for details.
When no secure element driver is present in the build, the presence of a transaction list file during initialization is an error.
When the stateful test secure element driver is present in the build, we run test cases on a representative selection of states of the internal storage and the test secure element. Each test case for transaction recovery has the following form:
psa_crypto_init()
. Expect success if the initial state satisfies the storage invariant and failure otherwise.For a given key located in a secure element, the following combination of states are possible:
We test all $2 \times 2 \times 3 = 12$ possibilities, each in its own test case. In each case, call the test function that checks the storage invariant and check that its result is as expected. Then, if the storage invariant is met, follow the recovery testing process.
In addition, have at least one positive test case for each creation method other than import, to ensure that we don't reject a valid value.
Note: testing of a damaged filesystem (including a filesystem that doesn't meet the invariant) is out of scope of the present document.