kernel: sched: plug assertion race in z_get_next_switch_handle()

Commit d4d51dc062bf ("kernel:  Replace redundant switch_handle assignment
with assertion") introduced an assertion check that may be triggered
as follows by tests/kernel/smp_abort:

CPU0              CPU1              CPU2
----              ----              ____
* [thread A]      * [thread B]      * [thread C]
* irq_offload()   * irq_offload()   * irq_offload()
* k_thread_abort(thread B)
                  * k_thread_abort(thread C)
                                    * k_thread_abort(thread A)
* thread_halt_spin()
* z_is_thread_halting(_current) is false
* while (z_is_thread_halting(thread B));
                  * thread_halt_spin()
                  * z_is_thread_halting(_current) is true
                  * halt_thread(_current...);
                  * z_dummy_thread_init()
                    - dummy_thread->switch_handle = NULL;
                    - _current = dummy_thread;
                  * while (z_is_thread_halting(thread C));
* z_get_next_switch_handle()
* z_arm64_context_switch()
* [thread A is dead]
                                    * thread_halt_spin()
                                    * z_is_thread_halting(_current) is true
                                    * halt_thread(_current...);
                                    * z_dummy_thread_init()
                                      - dummy_thread->switch_handle = NULL;
                                      - _current = dummy_thread;
                                    * while(z_is_thread_halting(thread A));
                  * z_get_next_switch_handle()
                    - old_thread == dummy_thread
                    - __ASSERT(old_thread->switch_handle == NULL) OK
                  * z_arm64_context_switch()
                    - str x1, [x1, #___thread_t_switch_handle_OFFSET]
                  * [thread B is dead]
                  * %%% dummy_thread->switch_handle no longer NULL %%%
                                    * z_get_next_switch_handle()
                                      - old_thread == dummy_thread
                                      - __ASSERT(old_thread->
                                             switch_handle == NULL) FAIL

This needs at least 3 CPUs and the perfect timing for the race to work as
sometimes CPUs 1 and 2 may be close enough in their execution paths for
the assertion to pass. For example, QEMU is OK while FVP is not.
Also adding sufficient debug traces can make the issue go away.

This happens because the dummy thread is shared among concurrent CPUs.
It could be argued that a per-CPU dummy thread structure would be the
proper solution to this problem. However the purpose of a dummy thread
structure is to provide a dumping ground for the scheduler code to work
while the original thread structure might already be reused and
therefore can't be clobbered as demonstrated above. But the dummy
structure _can_ be clobbered to some extent and it is not worth the
additional memory footprint implied by per-CPU instances. We just have
to ignore some validity tests when the dummy thread is concerned.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
1 file changed