Properly handle crashes/reboots during FabricTable commit (#20010)
* Properly handle crashes/reboots during FabricTable commit
- Since #19819, commits are very small and safer. There is less surface
to fail during commit. The previous large-scale fail-safe behavior
stored too much data, for too long and could cause larger reverts
even if nothing was committed yet. FabricTable data no longer is
ever persisted without commit.
- The existing code deleted fabrics unwittingly when not required,
such as when powering off a light during a fail-safe for an
update when there was nothing committed yet, assuming we still
committed immediately.
This change:
- Detects failed commits
- Only deletes data on failed commits
- Fixes Thread driver to detect stale data where a backup was done
(since we cannot prevent internal commits from OpenThread)
Testing done:
- Added unit test to FabricTable to generate the condition
- Did manual testing of all-clusters-app/chip-tool Linux that
aborted on second commissioning, during commit. Found that
cleanup occurred as expected on restart
- Integration/cert testing (including Cirque that validates
fail-safe) still pass
* Restyled by clang-format
* Restyled by gn
* Clear commit token after deferred cleanup
* Applied code review comments
* Restyled by clang-format
* Fix CI
* Fix CI
* Remove revert at restart of openthread after other changes
* Apply review comments
* Try to fix nRF CI
* Another attempt to fix nrfconnect CI
* Fix CI some more
* Fix CI again
Co-authored-by: Restyled.io <commits@restyled.io>
diff --git a/src/app/server/Server.cpp b/src/app/server/Server.cpp
index 0ea320d..8ea1db7 100644
--- a/src/app/server/Server.cpp
+++ b/src/app/server/Server.cpp
@@ -169,9 +169,6 @@
// This initializes clusters, so should come after lower level initialization.
InitDataModelHandler(&mExchangeMgr);
- // Clean-up previously standing fail-safes
- InitFailSafe();
-
// Init transport before operations with secure session mgr.
err = mTransports.Init(UdpListenParameters(DeviceLayer::UDPEndPointManager())
.SetAddressType(IPAddressType::kIPv6)
@@ -257,7 +254,6 @@
else
{
#if CHIP_DEVICE_CONFIG_ENABLE_PAIRING_AUTOSTART
- GetFabricTable().DeleteAllFabrics();
SuccessOrExit(err = mCommissioningWindowManager.OpenBasicCommissioningWindow());
#endif
}
@@ -300,6 +296,41 @@
RejoinExistingMulticastGroups();
#endif // !CHIP_SYSTEM_CONFIG_USE_OPEN_THREAD_ENDPOINT
+ // Handle deferred clean-up of a previously armed fail-safe that occurred during FabricTable commit.
+ // This is done at the very end since at the earlier time above when FabricTable::Init() is called,
+ // the delegates could not have been registered, and the other systems were not initialized. By now,
+ // everything is initialized, so we can do a deferred clean-up.
+ {
+ FabricIndex fabricIndexDeletedOnInit = GetFabricTable().GetDeletedFabricFromCommitMarker();
+ if (fabricIndexDeletedOnInit != kUndefinedFabricIndex)
+ {
+ ChipLogError(AppServer, "FabricIndex 0x%x deleted due to restart while fail-safed. Processing a clean-up!",
+ static_cast<unsigned>(fabricIndexDeletedOnInit));
+
+ // Always pretend it was an add, since being in the middle of an update currently breaks
+ // the validity of the fabric table. This is expected to be extremely infrequent, so
+ // this "harsher" than usual clean-up is more likely to get us in a valid state for whatever
+ // remains.
+ const bool addNocCalled = true;
+ const bool updateNocCalled = false;
+ GetFailSafeContext().ScheduleFailSafeCleanup(fabricIndexDeletedOnInit, addNocCalled, updateNocCalled);
+
+ // Schedule clearing of the commit marker to only occur after we have processed all fail-safe clean-up.
+ // Because Matter runs a single event loop for all scheduled work, it will occur after the above has
+ // taken place. If a reset occurs before we have cleaned everything up, the next boot will still
+ // see the commit marker.
+ PlatformMgr().ScheduleWork(
+ [](intptr_t arg) {
+ Server * server = reinterpret_cast<Server *>(arg);
+ VerifyOrReturn(server != nullptr);
+
+ server->GetFabricTable().ClearCommitMarker();
+ ChipLogProgress(AppServer, "Cleared FabricTable pending commit marker");
+ },
+ reinterpret_cast<intptr_t>(this));
+ }
+ }
+
PlatformMgr().HandleServerStarted();
exit:
@@ -315,39 +346,6 @@
return err;
}
-void Server::InitFailSafe()
-{
- bool failSafeArmed = false;
-
- CHIP_ERROR err = CHIP_NO_ERROR;
-
- // If the fail-safe was armed when the device last shutdown, initiate cleanup based on the pending Fail Safe Context with
- // which the fail-safe timer was armed.
- if (DeviceLayer::ConfigurationMgr().GetFailSafeArmed(failSafeArmed) == CHIP_NO_ERROR && failSafeArmed)
- {
- FabricIndex fabricIndex;
- bool addNocCommandInvoked;
- bool updateNocCommandInvoked;
-
- ChipLogProgress(AppServer, "Detected fail-safe armed on reboot");
-
- err = DeviceLayer::FailSafeContext::LoadFromStorage(fabricIndex, addNocCommandInvoked, updateNocCommandInvoked);
- if (err == CHIP_NO_ERROR)
- {
- DeviceLayer::DeviceControlServer::DeviceControlSvr().GetFailSafeContext().ScheduleFailSafeCleanup(
- fabricIndex, addNocCommandInvoked, updateNocCommandInvoked);
- }
- else
- {
- // This should not happen, but we should not fail system init based on it!
- ChipLogError(DeviceLayer, "Failed to load fail-safe context from storage (err= %" CHIP_ERROR_FORMAT "), cleaning-up!",
- err.Format());
- (void) DeviceLayer::ConfigurationMgr().SetFailSafeArmed(false);
- err = CHIP_NO_ERROR;
- }
- }
-}
-
void Server::RejoinExistingMulticastGroups()
{
ChipLogProgress(AppServer, "Joining Multicast groups");