docs/advanced/deadlock.md - third_party/github/pybind/pybind11 - Git at Google

 # Double locking, deadlocking, GIL

 [TOC]

 ## Introduction

 ### Overview

 In concurrent programming with locks, *deadlocks* can arise when more than one
 mutex is locked at the same time, and careful attention has to be paid to lock
 ordering to avoid this. Here we will look at a common situation that occurs in
 native extensions for CPython written in C++.

 ### Deadlocks

 A deadlock can occur when more than one thread attempts to lock more than one
 mutex, and two of the threads lock two of the mutexes in different orders. For
 example, consider mutexes `mu1` and `mu2`, and threads T1 and T2, executing:

 |    | T1                  | T2                 |
 |--- | ------------------- | -------------------|
 |1   | `mu1.lock()`{.good} | `mu2.lock()`{.good}|
 |2   | `mu2.lock()`{.bad}  | `mu1.lock()`{.bad} |
 |3   | `/* work */`        | `/* work */`       |
 |4   | `mu2.unlock()`      | `mu1.unlock()`     |
 |5   | `mu1.unlock()`      | `mu2.unlock()`     |

 Now if T1 manages to lock `mu1` and T2 manages to lock `mu2` (as indicated in
 green), then both threads will block while trying to lock the respective other
 mutex (as indicated in red), but they are also unable to release the mutex that
 they have locked (step 5).

 **The problem** is that it is possible for one thread to attempt to lock `mu1`
 and then `mu2`, and for another thread to attempt to lock `mu2` and then `mu1`.
 Note that it does not matter if either mutex is unlocked at any intermediate
 point; what matters is only the order of any attempt to *lock* the mutexes. For
 example, the following, more complex series of operations is just as prone to
 deadlock:

 |    | T1                  | T2                 |
 |--- | ------------------- | -------------------|
 |1   | `mu1.lock()`{.good} | `mu1.lock()`{.good}|
 |2   | waiting for T2      | `mu2.lock()`{.good}|
 |3   | waiting for T2      | `/* work */`       |
 |3   | waiting for T2      | `mu1.unlock()`     |
 |3   | `mu2.lock()`{.bad}  | `/* work */`       |
 |3   | `/* work */`        | `mu1.lock()`{.bad} |
 |3   | `/* work */`        | `/* work */`       |
 |4   | `mu2.unlock()`      | `mu1.unlock()`     |
 |5   | `mu1.unlock()`      | `mu2.unlock()`     |

 When the mutexes involved in a locking sequence are known at compile-time, then
 avoiding deadlocks is &ldquo;merely&rdquo; a matter of arranging the lock
 operations carefully so as to only occur in one single, fixed order. However, it
 is also possible for mutexes to only be determined at runtime. A typical example
 of this is a database where each row has its own mutex. An operation that
 modifies two rows in a single transaction (e.g. &ldquo;transferring an amount
 from one account to another&rdquo;) must lock two row mutexes, but the locking
 order cannot be established at compile time. In this case, a dynamic
 &ldquo;deadlock avoidance algorithm&rdquo; is needed. (In C++, `std::lock`
 provides such an algorithm. An algorithm might use a non-blocking `try_lock`
 operation on a mutex, which can either succeed or fail to lock the mutex, but
 returns without blocking.)

 Conceptually, one could also consider it a deadlock if _the same_ thread
 attempts to lock a mutex that it has already locked (e.g. when some locked
 operation accidentally recurses into itself): `mu.lock();`{.good}
 `mu.lock();`{.bad} However, this is a slightly separate issue: Typical mutexes
 are either of _recursive_ or _non-recursive_ kind. A recursive mutex allows
 repeated locking and requires balanced unlocking. A non-recursive mutex can be
 implemented more efficiently, and/but for efficiency reasons does not actually
 guarantee a deadlock on second lock. Instead, the API simply forbids such use,
 making it a precondition that the thread not already hold the mutex, with
 undefined behaviour on violation.

 ### &ldquo;Once&rdquo; initialization

 A common programming problem is to have an operation happen precisely once, even
 if requested concurrently. While it is clear that we need to track in some
 shared state somewhere whether the operation has already happened, it is worth
 noting that this state only ever transitions, once, from `false` to `true`. This
 is considerably simpler than a general shared state that can change values
 arbitrarily. Next, we also need a mechanism for all but one thread to block
 until the initialization has completed, which we can provide with a mutex. The
 simplest solution just always locks the mutex:

 ```c++
 // The "once" mechanism:
 constinit absl::Mutex mu(absl::kConstInit);
 constinit bool init_done = false;

 // The operation of interest:
 void f();

 void InitOnceNaive() {
   absl::MutexLock lock(&mu);
   if (!init_done) {
     f();
     init_done = true;
   }
 }
 ```

 This works, but the efficiency-minded reader will observe that once the
 operation has completed, all future lock contention on the mutex is
 unnecessary. This leads to the (in)famous &ldquo;double-locking&rdquo;
 algorithm, which was historically hard to write correctly. The idea is to check
 the boolean *before* locking the mutex, and avoid locking if the operation has
 already completed. However, accessing shared state concurrently when at least
 one access is a write is prone to causing a data race and needs to be done
 according to an appropriate concurrent programming model. In C++ we use atomic
 variables:

 ```c++
 // The "once" mechanism:
 constinit absl::Mutex mu(absl::kConstInit);
 constinit std::atomic<bool> init_done = false;

 // The operation of interest:
 void f();

 void InitOnceWithFastPath() {
   if (!init_done.load(std::memory_order_acquire)) {
     absl::MutexLock lock(&mu);
     if (!init_done.load(std::memory_order_relaxed)) {
       f();
       init_done.store(true, std::memory_order_release);
     }
   }
 }
 ```

 Checking the flag now happens without holding the mutex lock, and if the
 operation has already completed, we return immediately. After locking the mutex,
 we need to check the flag again, since multiple threads can reach this point.

 *Atomic details.* Since the atomic flag variable is accessed concurrently, we
 have to think about the memory order of the accesses. There are two separate
 cases: The first, outer check outside the mutex lock, and the second, inner
 check under the lock. The outer check and the flag update form an
 acquire/release pair: *if* the load sees the value `true` (which must have been
 written by the store operation), then it also sees everything that happened
 before the store, namely the operation `f()`. By contrast, the inner check can
 use relaxed memory ordering, since in that case the mutex operations provide the
 necessary ordering: if the inner load sees the value `true`, it happened after
 the `lock()`, which happened after the `unlock()`, which happened after the
 store.

 The C++ standard library, and Abseil, provide a ready-made solution of this
 algorithm called `std::call_once`/`absl::call_once`. (The interface is the same,
 but the Abseil implementation is possibly better.)

 ```c++
 // The "once" mechanism:
 constinit absl::once_flag init_flag;

 // The operation of interest:
 void f();

 void InitOnceWithCallOnce() {
   absl::call_once(once_flag, f);
 }
 ```

 Even though conceptually this is performing the same algorithm, this
 implementation has some considerable advantages: The `once_flag` type is a small
 and trivial, integer-like type and is trivially destructible. Not only does it
 take up less space than a mutex, it also generates less code since it does not
 have to run a destructor, which would need to be added to the program's global
 destructor list.

 The final clou comes with the C++ semantics of a `static` variable declared at
 block scope: According to [[stmt.dcl]](https://eel.is/c++draft/stmt.dcl#3):

 > Dynamic initialization of a block variable with static storage duration or
 > thread storage duration is performed the first time control passes through its
 > declaration; such a variable is considered initialized upon the completion of
 > its initialization. [...] If control enters the declaration concurrently while
 > the variable is being initialized, the concurrent execution shall wait for
 > completion of the initialization.

 This is saying that the initialization of a local, `static` variable precisely
 has the &ldquo;once&rdquo; semantics that we have been discussing. We can
 therefore write the above example as follows:

 ```c++
 // The operation of interest:
 void f();

 void InitOnceWithStatic() {
   static int unused = (f(), 0);
 }
 ```

 This approach is by far the simplest and easiest, but the big difference is that
 the mutex (or mutex-like object) in this implementation is no longer visible or
 in the user&rsquo;s control. This is perfectly fine if the initializer is
 simple, but if the initializer itself attempts to lock any other mutex
 (including by initializing another static variable!), then we have no control
 over the lock ordering!

 Finally, you may have noticed the `constinit`s around the earlier code. Both
 `constinit` and `constexpr` specifiers on a declaration mean that the variable
 is *constant-initialized*, which means that no initialization is performed at
 runtime (the initial value is already known at compile time). This in turn means
 that a static variable guard mutex may not be needed, and static initialization
 never blocks. The difference between the two is that a `constexpr`-specified
 variable is also `const`, and a variable cannot be `constexpr` if it has a
 non-trivial destructor. Such a destructor also means that the guard mutex is
 needed after all, since the destructor must be registered to run at exit,
 conditionally on initialization having happened.

 ## Python, CPython, GIL

 With CPython, a Python program can call into native code. To this end, the
 native code registers callback functions with the Python runtime via the CPython
 API. In order to ensure that the internal state of the Python runtime remains
 consistent, there is a single, shared mutex called the &ldquo;global interpreter
 lock&rdquo;, or GIL for short. Upon entry of one of the user-provided callback
 functions, the GIL is locked (or &ldquo;held&rdquo;), so that no other mutations
 of the Python runtime state can occur until the native callback returns.

 Many native extensions do not interact with the Python runtime for at least some
 part of them, and so it is common for native extensions to _release_ the GIL, do
 some work, and then reacquire the GIL before returning. Similarly, when code is
 generally not holding the GIL but needs to interact with the runtime briefly, it
 will first reacquire the GIL. The GIL is reentrant, and constructions to acquire
 and subsequently release the GIL are common, and often don't worry about whether
 the GIL is already held.

 If the native code is written in C++ and contains local, `static` variables,
 then we are now dealing with at least _two_ mutexes: the static variable guard
 mutex, and the GIL from CPython.

 A common problem in such code is an operation with &ldquo;only once&rdquo;
 semantics that also ends up requiring the GIL to be held at some point. As per
 the above description of &ldquo;once&rdquo;-style techniques, one might find a
 static variable:

 ```c++
 // CPython callback, assumes that the GIL is held on entry.
 PyObject* InvokeWidget(PyObject* self) {
   static PyObject* impl = CreateWidget();
   return PyObject_CallOneArg(impl, self);
 }
 ```

 This seems reasonable, but bear in mind that there are two mutexes (the "guard
 mutex" and "the GIL"), and we must think about the lock order. Otherwise, if the
 callback is called from multiple threads, a deadlock may ensue.

 Let us consider what we can see here: On entry, the GIL is already locked, and
 we are locking the guard mutex. This is one lock order. Inside the initializer
 `CreateWidget`, with both mutexes already locked, the function can freely access
 the Python runtime.

 However, it is entirely possible that `CreateWidget` will want to release the
 GIL at one point and reacquire it later:

 ```c++
 // Assumes that the GIL is held on entry.
 // Ensures that the GIL is held on exit.
 PyObject* CreateWidget() {
   // ...
   Py_BEGIN_ALLOW_THREADS  // releases GIL
   // expensive work, not accessing the Python runtime
   Py_END_ALLOW_THREADS    // acquires GIL, #!
   // ...
   return result;
 }
 ```

 Now we have a second lock order: the guard mutex is locked, and then the GIL is
 locked (at `#!`). To see how this deadlocks, consider threads T1 and T2 both
 having the runtime attempt to call `InvokeWidget`. T1 locks the GIL and
 proceeds, locking the guard mutex and calling `CreateWidget`; T2 is blocked
 waiting for the GIL. Then T1 releases the GIL to do &ldquo;expensive
 work&rdquo;, and T2 awakes and locks the GIL. Now T2 is blocked trying to
 acquire the guard mutex, but T1 is blocked reacquiring the GIL (at `#!`).

 In other words: if we want to support &ldquo;once-called&rdquo; functions that
 can arbitrarily release and reacquire the GIL, as is very common, then the only
 lock order that we can ensure is: guard mutex first, GIL second.

 To implement this, we must rewrite our code. Naively, we could always release
 the GIL before a `static` variable with blocking initializer:

 ```c++
 // CPython callback, assumes that the GIL is held on entry.
 PyObject* InvokeWidget(PyObject* self) {
   Py_BEGIN_ALLOW_THREADS  // releases GIL
   static PyObject* impl = CreateWidget();
   Py_END_ALLOW_THREADS    // acquires GIL

   return PyObject_CallOneArg(impl, self);
 }
 ```

 But similar to the `InitOnceNaive` example above, this code cycles the GIL
 (possibly descheduling the thread) even when the static variable has already
 been initialized. If we want to avoid this, we need to abandon the use of a
 static variable, since we do not control the guard mutex well enough. Instead,
 we use an operation whose mutex locking is under our control, such as
 `call_once`. For example:

 ```c++
 // CPython callback, assumes that the GIL is held on entry.
 PyObject* InvokeWidget(PyObject* self) {
   static constinit PyObject* impl = nullptr;
   static constinit std::atomic<bool> init_done = false;
   static constinit absl::once_flag init_flag;

   if (!init_done.load(std::memory_order_acquire)) {
     Py_BEGIN_ALLOW_THREADS                       // releases GIL
     absl::call_once(init_flag, [&]() {
       PyGILState_STATE s = PyGILState_Ensure();  // acquires GIL
       impl = CreateWidget();
       PyGILState_Release(s);                     // releases GIL
       init_done.store(true, std::memory_order_release);
     });
     Py_END_ALLOW_THREADS                         // acquires GIL
   }

   return PyObject_CallOneArg(impl, self);
 }
 ```

 The lock order is now always guard mutex first, GIL second. Unfortunately we
 have to duplicate the &ldquo;double-checked done flag&rdquo;, effectively
 leading to triple checking, because the flag state inside the `absl::once_flag`
 is not accessible to the user. In other words, we cannot ask `init_flag` whether
 it has been used yet.

 However, we can perform one last, minor optimisation: since we assume that the
 GIL is held on entry, and again when the initializing operation returns, the GIL
 actually serializes access to our done flag variable, which therefore does not
 need to be atomic. (The difference to the previous, atomic code may be small,
 depending on the architecture. For example, on x86-64, acquire/release on a bool
 is nearly free ([demo](https://godbolt.org/z/P9vYWf4fE)).)

 ```c++
 // CPython callback, assumes that the GIL is held on entry, and indeed anywhere
 // directly in this function (i.e. the GIL can be released inside CreateWidget,
 // but must be reaqcuired when that call returns).
 PyObject* InvokeWidget(PyObject* self) {
   static constinit PyObject* impl = nullptr;
   static constinit bool init_done = false;       // guarded by GIL
   static constinit absl::once_flag init_flag;

   if (!init_done) {
     Py_BEGIN_ALLOW_THREADS                       // releases GIL
                                                  // (multiple threads may enter here)
     absl::call_once(init_flag, [&]() {
                                                  // (only one thread enters here)
       PyGILState_STATE s = PyGILState_Ensure();  // acquires GIL
       impl = CreateWidget();
       init_done = true;                          // (GIL is held)
       PyGILState_Release(s);                     // releases GIL
     });

     Py_END_ALLOW_THREADS                         // acquires GIL
   }

   return PyObject_CallOneArg(impl, self);
 }
 ```

 ## Debugging tips

 *   Build with symbols.
 *   <kbd>Ctrl</kbd>-<kbd>C</kbd> sends `SIGINT`, <kbd>Ctrl</kbd>-<kbd>\\</kbd>
     sends `SIGQUIT`. Both have their uses.
 *   Useful `gdb` commands:
     *   `py-bt` prints a Python backtrace if you are in a Python frame.
     *   `thread apply all bt 10` prints the top-10 frames for each thread. A
         full backtrace can be prohibitively expensive, and the top few frames
         are often good enough.
     *   `p PyGILState_Check()` shows whether a thread is holding the GIL. For
         all threads, run `thread apply all p PyGILState_Check()` to find out
         which thread is holding the GIL.
     *   The `static` variable guard mutex is accessed with functions like
         `cxa_guard_acquire` (though this depends on ABI details and can vary).
         The guard mutex itself contains information about which thread is
         currently holding it.

 ## Links

 *   Article on
     [double-checked locking](https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/)
 *   [The Deadlock Empire](https://deadlockempire.github.io/), hands-on exercises
     to construct deadlocks
	# Double locking, deadlocking, GIL

	[TOC]

	## Introduction

	### Overview

	In concurrent programming with locks, deadlocks can arise when more than one
	mutex is locked at the same time, and careful attention has to be paid to lock
	ordering to avoid this. Here we will look at a common situation that occurs in
	native extensions for CPython written in C++.

	### Deadlocks

	A deadlock can occur when more than one thread attempts to lock more than one
	mutex, and two of the threads lock two of the mutexes in different orders. For
	example, consider mutexes `mu1` and `mu2`, and threads T1 and T2, executing:

	\| \| T1 \| T2 \|
	\|--- \| ------------------- \| -------------------\|
	\|1 \| `mu1.lock()`{.good} \| `mu2.lock()`{.good}\|
	\|2 \| `mu2.lock()`{.bad} \| `mu1.lock()`{.bad} \|
	\|3 \| `/* work /` \| `/ work */` \|
	\|4 \| `mu2.unlock()` \| `mu1.unlock()` \|
	\|5 \| `mu1.unlock()` \| `mu2.unlock()` \|

	Now if T1 manages to lock `mu1` and T2 manages to lock `mu2` (as indicated in
	green), then both threads will block while trying to lock the respective other
	mutex (as indicated in red), but they are also unable to release the mutex that
	they have locked (step 5).

	The problem is that it is possible for one thread to attempt to lock `mu1`
	and then `mu2`, and for another thread to attempt to lock `mu2` and then `mu1`.
	Note that it does not matter if either mutex is unlocked at any intermediate
	point; what matters is only the order of any attempt to lock the mutexes. For
	example, the following, more complex series of operations is just as prone to
	deadlock:

	\| \| T1 \| T2 \|
	\|--- \| ------------------- \| -------------------\|
	\|1 \| `mu1.lock()`{.good} \| `mu1.lock()`{.good}\|
	\|2 \| waiting for T2 \| `mu2.lock()`{.good}\|
	\|3 \| waiting for T2 \| `/* work */` \|
	\|3 \| waiting for T2 \| `mu1.unlock()` \|
	\|3 \| `mu2.lock()`{.bad} \| `/* work */` \|
	\|3 \| `/* work */` \| `mu1.lock()`{.bad} \|
	\|3 \| `/* work /` \| `/ work */` \|
	\|4 \| `mu2.unlock()` \| `mu1.unlock()` \|
	\|5 \| `mu1.unlock()` \| `mu2.unlock()` \|

	When the mutexes involved in a locking sequence are known at compile-time, then
	avoiding deadlocks is “merely” a matter of arranging the lock
	operations carefully so as to only occur in one single, fixed order. However, it
	is also possible for mutexes to only be determined at runtime. A typical example
	of this is a database where each row has its own mutex. An operation that
	modifies two rows in a single transaction (e.g. “transferring an amount
	from one account to another”) must lock two row mutexes, but the locking
	order cannot be established at compile time. In this case, a dynamic
	“deadlock avoidance algorithm” is needed. (In C++, `std::lock`
	provides such an algorithm. An algorithm might use a non-blocking `try_lock`
	operation on a mutex, which can either succeed or fail to lock the mutex, but
	returns without blocking.)

	Conceptually, one could also consider it a deadlock if _the same_ thread
	attempts to lock a mutex that it has already locked (e.g. when some locked
	operation accidentally recurses into itself): `mu.lock();`{.good}
	`mu.lock();`{.bad} However, this is a slightly separate issue: Typical mutexes
	are either of _recursive_ or _non-recursive_ kind. A recursive mutex allows
	repeated locking and requires balanced unlocking. A non-recursive mutex can be
	implemented more efficiently, and/but for efficiency reasons does not actually
	guarantee a deadlock on second lock. Instead, the API simply forbids such use,
	making it a precondition that the thread not already hold the mutex, with
	undefined behaviour on violation.

	### “Once” initialization

	A common programming problem is to have an operation happen precisely once, even
	if requested concurrently. While it is clear that we need to track in some
	shared state somewhere whether the operation has already happened, it is worth
	noting that this state only ever transitions, once, from `false` to `true`. This
	is considerably simpler than a general shared state that can change values
	arbitrarily. Next, we also need a mechanism for all but one thread to block
	until the initialization has completed, which we can provide with a mutex. The
	simplest solution just always locks the mutex:

	```c++
	// The "once" mechanism:
	constinit absl::Mutex mu(absl::kConstInit);
	constinit bool init_done = false;

	// The operation of interest:
	void f();

	void InitOnceNaive() {
	absl::MutexLock lock(&mu);
	if (!init_done) {
	f();
	init_done = true;
	}
	}
	```

	This works, but the efficiency-minded reader will observe that once the
	operation has completed, all future lock contention on the mutex is
	unnecessary. This leads to the (in)famous “double-locking”
	algorithm, which was historically hard to write correctly. The idea is to check
	the boolean before locking the mutex, and avoid locking if the operation has
	already completed. However, accessing shared state concurrently when at least
	one access is a write is prone to causing a data race and needs to be done
	according to an appropriate concurrent programming model. In C++ we use atomic
	variables:

	```c++
	// The "once" mechanism:
	constinit absl::Mutex mu(absl::kConstInit);
	constinit std::atomic<bool> init_done = false;

	// The operation of interest:
	void f();

	void InitOnceWithFastPath() {
	if (!init_done.load(std::memory_order_acquire)) {
	absl::MutexLock lock(&mu);
	if (!init_done.load(std::memory_order_relaxed)) {
	f();
	init_done.store(true, std::memory_order_release);
	}
	}
	}
	```

	Checking the flag now happens without holding the mutex lock, and if the
	operation has already completed, we return immediately. After locking the mutex,
	we need to check the flag again, since multiple threads can reach this point.

	Atomic details. Since the atomic flag variable is accessed concurrently, we
	have to think about the memory order of the accesses. There are two separate
	cases: The first, outer check outside the mutex lock, and the second, inner
	check under the lock. The outer check and the flag update form an
	acquire/release pair: if the load sees the value `true` (which must have been
	written by the store operation), then it also sees everything that happened
	before the store, namely the operation `f()`. By contrast, the inner check can
	use relaxed memory ordering, since in that case the mutex operations provide the
	necessary ordering: if the inner load sees the value `true`, it happened after
	the `lock()`, which happened after the `unlock()`, which happened after the
	store.

	The C++ standard library, and Abseil, provide a ready-made solution of this
	algorithm called `std::call_once`/`absl::call_once`. (The interface is the same,
	but the Abseil implementation is possibly better.)

	```c++
	// The "once" mechanism:
	constinit absl::once_flag init_flag;

	// The operation of interest:
	void f();

	void InitOnceWithCallOnce() {
	absl::call_once(once_flag, f);
	}
	```

	Even though conceptually this is performing the same algorithm, this
	implementation has some considerable advantages: The `once_flag` type is a small
	and trivial, integer-like type and is trivially destructible. Not only does it
	take up less space than a mutex, it also generates less code since it does not
	have to run a destructor, which would need to be added to the program's global
	destructor list.

	The final clou comes with the C++ semantics of a `static` variable declared at
	block scope: According to [[stmt.dcl]](https://eel.is/c++draft/stmt.dcl#3):

	> Dynamic initialization of a block variable with static storage duration or
	> thread storage duration is performed the first time control passes through its
	> declaration; such a variable is considered initialized upon the completion of
	> its initialization. [...] If control enters the declaration concurrently while
	> the variable is being initialized, the concurrent execution shall wait for
	> completion of the initialization.

	This is saying that the initialization of a local, `static` variable precisely
	has the “once” semantics that we have been discussing. We can
	therefore write the above example as follows:

	```c++
	// The operation of interest:
	void f();

	void InitOnceWithStatic() {
	static int unused = (f(), 0);
	}
	```

	This approach is by far the simplest and easiest, but the big difference is that
	the mutex (or mutex-like object) in this implementation is no longer visible or
	in the user’s control. This is perfectly fine if the initializer is
	simple, but if the initializer itself attempts to lock any other mutex
	(including by initializing another static variable!), then we have no control
	over the lock ordering!

	Finally, you may have noticed the `constinit`s around the earlier code. Both
	`constinit` and `constexpr` specifiers on a declaration mean that the variable
	is constant-initialized, which means that no initialization is performed at
	runtime (the initial value is already known at compile time). This in turn means
	that a static variable guard mutex may not be needed, and static initialization
	never blocks. The difference between the two is that a `constexpr`-specified
	variable is also `const`, and a variable cannot be `constexpr` if it has a
	non-trivial destructor. Such a destructor also means that the guard mutex is
	needed after all, since the destructor must be registered to run at exit,
	conditionally on initialization having happened.

	## Python, CPython, GIL

	With CPython, a Python program can call into native code. To this end, the
	native code registers callback functions with the Python runtime via the CPython
	API. In order to ensure that the internal state of the Python runtime remains
	consistent, there is a single, shared mutex called the “global interpreter
	lock”, or GIL for short. Upon entry of one of the user-provided callback
	functions, the GIL is locked (or “held”), so that no other mutations
	of the Python runtime state can occur until the native callback returns.

	Many native extensions do not interact with the Python runtime for at least some
	part of them, and so it is common for native extensions to _release_ the GIL, do
	some work, and then reacquire the GIL before returning. Similarly, when code is
	generally not holding the GIL but needs to interact with the runtime briefly, it
	will first reacquire the GIL. The GIL is reentrant, and constructions to acquire
	and subsequently release the GIL are common, and often don't worry about whether
	the GIL is already held.

	If the native code is written in C++ and contains local, `static` variables,
	then we are now dealing with at least _two_ mutexes: the static variable guard
	mutex, and the GIL from CPython.

	A common problem in such code is an operation with “only once”
	semantics that also ends up requiring the GIL to be held at some point. As per
	the above description of “once”-style techniques, one might find a
	static variable:

	```c++
	// CPython callback, assumes that the GIL is held on entry.
	PyObject* InvokeWidget(PyObject* self) {
	static PyObject* impl = CreateWidget();
	return PyObject_CallOneArg(impl, self);
	}
	```

	This seems reasonable, but bear in mind that there are two mutexes (the "guard
	mutex" and "the GIL"), and we must think about the lock order. Otherwise, if the
	callback is called from multiple threads, a deadlock may ensue.

	Let us consider what we can see here: On entry, the GIL is already locked, and
	we are locking the guard mutex. This is one lock order. Inside the initializer
	`CreateWidget`, with both mutexes already locked, the function can freely access
	the Python runtime.

	However, it is entirely possible that `CreateWidget` will want to release the
	GIL at one point and reacquire it later:

	```c++
	// Assumes that the GIL is held on entry.
	// Ensures that the GIL is held on exit.
	PyObject* CreateWidget() {
	// ...
	Py_BEGIN_ALLOW_THREADS // releases GIL
	// expensive work, not accessing the Python runtime
	Py_END_ALLOW_THREADS // acquires GIL, #!
	// ...
	return result;
	}
	```

	Now we have a second lock order: the guard mutex is locked, and then the GIL is
	locked (at `#!`). To see how this deadlocks, consider threads T1 and T2 both
	having the runtime attempt to call `InvokeWidget`. T1 locks the GIL and
	proceeds, locking the guard mutex and calling `CreateWidget`; T2 is blocked
	waiting for the GIL. Then T1 releases the GIL to do “expensive
	work”, and T2 awakes and locks the GIL. Now T2 is blocked trying to
	acquire the guard mutex, but T1 is blocked reacquiring the GIL (at `#!`).

	In other words: if we want to support “once-called” functions that
	can arbitrarily release and reacquire the GIL, as is very common, then the only
	lock order that we can ensure is: guard mutex first, GIL second.

	To implement this, we must rewrite our code. Naively, we could always release
	the GIL before a `static` variable with blocking initializer:

	```c++
	// CPython callback, assumes that the GIL is held on entry.
	PyObject* InvokeWidget(PyObject* self) {
	Py_BEGIN_ALLOW_THREADS // releases GIL
	static PyObject* impl = CreateWidget();
	Py_END_ALLOW_THREADS // acquires GIL

	return PyObject_CallOneArg(impl, self);
	}
	```

	But similar to the `InitOnceNaive` example above, this code cycles the GIL
	(possibly descheduling the thread) even when the static variable has already
	been initialized. If we want to avoid this, we need to abandon the use of a
	static variable, since we do not control the guard mutex well enough. Instead,
	we use an operation whose mutex locking is under our control, such as
	`call_once`. For example:

	```c++
	// CPython callback, assumes that the GIL is held on entry.
	PyObject* InvokeWidget(PyObject* self) {
	static constinit PyObject* impl = nullptr;
	static constinit std::atomic<bool> init_done = false;
	static constinit absl::once_flag init_flag;

	if (!init_done.load(std::memory_order_acquire)) {
	Py_BEGIN_ALLOW_THREADS // releases GIL
	absl::call_once(init_flag, [&]() {
	PyGILState_STATE s = PyGILState_Ensure(); // acquires GIL
	impl = CreateWidget();
	PyGILState_Release(s); // releases GIL
	init_done.store(true, std::memory_order_release);
	});
	Py_END_ALLOW_THREADS // acquires GIL
	}

	return PyObject_CallOneArg(impl, self);
	}
	```

	The lock order is now always guard mutex first, GIL second. Unfortunately we
	have to duplicate the “double-checked done flag”, effectively
	leading to triple checking, because the flag state inside the `absl::once_flag`
	is not accessible to the user. In other words, we cannot ask `init_flag` whether
	it has been used yet.

	However, we can perform one last, minor optimisation: since we assume that the
	GIL is held on entry, and again when the initializing operation returns, the GIL
	actually serializes access to our done flag variable, which therefore does not
	need to be atomic. (The difference to the previous, atomic code may be small,
	depending on the architecture. For example, on x86-64, acquire/release on a bool
	is nearly free ([demo](https://godbolt.org/z/P9vYWf4fE)).)

	```c++
	// CPython callback, assumes that the GIL is held on entry, and indeed anywhere
	// directly in this function (i.e. the GIL can be released inside CreateWidget,
	// but must be reaqcuired when that call returns).
	PyObject* InvokeWidget(PyObject* self) {
	static constinit PyObject* impl = nullptr;
	static constinit bool init_done = false; // guarded by GIL
	static constinit absl::once_flag init_flag;

	if (!init_done) {
	Py_BEGIN_ALLOW_THREADS // releases GIL
	// (multiple threads may enter here)
	absl::call_once(init_flag, [&]() {
	// (only one thread enters here)
	PyGILState_STATE s = PyGILState_Ensure(); // acquires GIL
	impl = CreateWidget();
	init_done = true; // (GIL is held)
	PyGILState_Release(s); // releases GIL
	});

	Py_END_ALLOW_THREADS // acquires GIL
	}

	return PyObject_CallOneArg(impl, self);
	}
	```

	## Debugging tips

	* Build with symbols.
	* <kbd>Ctrl</kbd>-<kbd>C</kbd> sends `SIGINT`, <kbd>Ctrl</kbd>-<kbd>\\</kbd>
	sends `SIGQUIT`. Both have their uses.
	* Useful `gdb` commands:
	* `py-bt` prints a Python backtrace if you are in a Python frame.
	* `thread apply all bt 10` prints the top-10 frames for each thread. A
	full backtrace can be prohibitively expensive, and the top few frames
	are often good enough.
	* `p PyGILState_Check()` shows whether a thread is holding the GIL. For
	all threads, run `thread apply all p PyGILState_Check()` to find out
	which thread is holding the GIL.
	* The `static` variable guard mutex is accessed with functions like
	`cxa_guard_acquire` (though this depends on ABI details and can vary).
	The guard mutex itself contains information about which thread is
	currently holding it.

	## Links

	* Article on
	[double-checked locking](https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/)
	* [The Deadlock Empire](https://deadlockempire.github.io/), hands-on exercises
	to construct deadlocks