blob: 5fdf9ff55d2bf82f426ab343b0e4e5da1f973114 [file] [log] [blame]
.. _seed-0109:
===========================
0109: Communication Buffers
===========================
.. seed::
:number: 109
:name: Communication Buffers
:status: Accepted
:proposal_date: 2023-08-28
:cl: 168357
:authors: Taylor Cramer
:facilitator: Erik Gilling
-------
Summary
-------
This SEED proposes that Pigweed adopt a standard buffer type for network-style
communications. This buffer type will be used in the new sockets API
(see the recently-accepted `Communications SEED 107
<https://pigweed.dev/seed/0107-communications.html>`_ for more details), and
will be carried throughout the communications stack as-appropriate.
------------------------
Top-Level Usage Examples
------------------------
Sending a Proto Into a Socket
=============================
.. code-block:: cpp
Allocator& msg_allocator = socket.Allocator();
size_t size = EncodedProtoSize(some_proto);
std::optional<pw::MultiBuf> buffer = msg_allocator.Allocate(size);
if (!buffer) { return; }
EncodeProto(some_proto, *buffer);
socket.Send(std::move(*buffer));
``Socket`` s provide an allocator which should be used to create a
``pw::MultiBuf``. The data to be sent is then filled into this buffer before
being sent. ``Socket`` must accept ``pw::MultiBuf`` s which were not created
by their own ``Allocator``, but this may incur a performance penalty.
Zero-Copy Fragmentation
=======================
The example above has a hidden feature: zero-copy fragmentation! The socket
can provide a ``pw::MultiBuf`` which is divided up into separate MTU-sized
chunks, each of which has reserved space for headers and/or footers:
.. code-block:: cpp
std::optional<pw::MultiBuf> Allocate(size_t size) {
if (size == 0) { return pw::MultiBuf(); }
size_t data_per_chunk = mtu_size_ - header_size_;
size_t num_chunks = 1 + ((size - 1) / data_per_chunk);
std::optional<pw::MultiBuf> buffer = pw::MultiBuf::WithCapacity(num_chunks);
if (!buffer) { return std::nullopt; }
for (size_t i = 0; i < num_chunks; i++) {
// Note: we could allocate smaller than `mtu_size_` for the final
// chunk. This is ommitted here for brevity.
std::optional<pw::MultiBuf::Chunk> chunk = internal_alloc_.Allocate(mtu_size_);
if (!chunk) { return std::nullopt; }
// Reserve the header size by discarding the front of each buffer.
chunk->DiscardFront(header_size_);
buffer->Chunks()[i] = std::move(chunk);
}
return *buffer;
}
This code reserves ``header_size_`` bytes at the beginning of each ``Chunk``.
When the caller writes into this memory and then passes it back to the socket,
these bytes can be claimed for the header using the ``ClaimPrefix`` function.
One usecase that seems to demand the ability to fragment like this is breaking
up ``SOCK_SEQPACKET`` messages which (at least on Unix / Linux) may be much larger
than the MTU size: up to ``SO_SNDBUF`` (see `this man page
<https://man7.org/linux/man-pages/man7/socket.7.html>`_).
Multiplexing Incoming Data
==========================
.. code-block:: cpp
[[nodiscard]] bool SplitAndSend(pw::MultiBuf buffer) {
std::optional<std::array<pw::MultiBuf, 2>> buffers =
std::move(buffer).Split(split_index);
if (!buffers) { return false; }
socket_1_.Send(std::move(buffers->at(0)));
socket_2_.Send(std::move(buffers->at(1)));
return true;
}
Incoming buffers can be split without copying, and the results can be forwarded
to multiple different ``Socket`` s, RPC services or clients.
----------
Motivation
----------
Today, a Pigweed communications stack typically involves a number of different
buffers.
``pw_rpc`` users, for example, frequently use direct-memory access (DMA) or
other system primitives to read data into a buffer, apply some link-layer
protocol such as HDLC which copies data into a second buffer, pass the resulting
data into pw_rpc which parses it into its own buffer. Multiple sets of buffers
are present on the output side as well. Between DMA in and DMA out, it's easy
for data to pass through six or more different buffers.
These independent buffer systems introduce both time and space overhead. Aside
from the additional CPU time required to move the data around, users need to
ensure that all of the different buffer pools involved along the way have enough
space reserved to contain the entire message. Where caching is present, moving
the memory between locations may create an additional delay by thrashing
between memory regions.
--------
Proposal
--------
``pw::buffers::MultiBuf`` is a handle to a buffer optimized for use within
Pigweed's communications stack. It provides efficient, low-overhead buffer
management, and serves as a standard type for passing data between drivers,
TCP/IP implementations, RPC 2.0, and transfer 2.0.
A single ``MultiBuf`` is a list of ``Chunk`` s of data. Each ``Chunk``
points to an exclusively-owned portion of a reference-counted allocation.
``MultiBuf`` s can be easily split, joined, prefixed, postfixed, or infixed
without needing to copy the underlying data.
The memory pointed to by ``Chunk`` s is typically allocated from a pool
provided by a ``Socket``. This allows the ``Socket`` to provide backpressure to
callers, and to ensure that memory is placed in DMA or shared memory regions
as-necessary.
In-Memory Layout
================
This diagram shows an example in-memory relationship between two buffers:
- ``Buffer1`` references one chunks from region A.
- ``Buffer2`` references two chunk from regions A and B.
.. mermaid::
graph TB;
Buffer1 --> Chunk1A
Chunk1A -- "[0..64]" --> MemoryRegionA(Memory Region A)
Chunk1A --> ChunkRegionTrackerA
Buffer2 --> Chunk2A & Chunk2B
Chunk2A --> ChunkRegionTrackerA
Chunk2A -- "[64..128]" --> MemoryRegionA(Memory Region A)
Chunk2B -- "[0..128]" --> MemoryRegionB
Chunk2B --> ChunkRegionTrackerB
API
===
The primary API is as follows:
.. code-block:: cpp
/// An object that manages a single allocated region which is referenced
/// by one or more chunks.
class ChunkRegionTracker {
public:
ChunkRegionTracker(ByteSpan);
/// Creates the first ``Chunk`` referencing a whole region of memory.
/// This must only be called once per ``ChunkRegionTracker``.
Chunk ChunkForRegion();
protected:
pw::Mutex lock();
/// Destroys the `ChunkRegionTracker`.
///
/// Typical implementations will call `std::destroy_at(this)` and then
/// free the memory associated with the tracker.
virtual void Destroy();
/// Defines the entire span of the region being managed. ``Chunk`` s
/// referencing this tracker may not expand beyond this region
/// (or into one another).
///
/// This region must not change for the lifetime of this
/// ``ChunkRegionTracker``.
virtual ByteSpan region();
private:
/// Used to manage the internal linked-list of ``Chunk`` s that allows
/// chunks to know whether or not they can expand to fill neighboring
/// regions of memory.
pw::Mutex lock_;
};
/// A handle to a contiguous refcounted slice of data.
///
/// Note: this Chunk may acquire a ``pw::sync::Mutex`` during destruction, and
/// so must not be destroyed within ISR context.
class Chunk {
public:
Chunk();
Chunk(ChunkRegionTracker&);
Chunk(Chunk&& other) noexcept;
Chunk& operator=(Chunk&& other);
~Chunk();
std::byte* data();
const std::byte* data() const;
ByteSpan span();
ConstByteSpan span() const;
size_t size() const;
std::byte& operator[](size_t index);
std::byte operator[](size_t index) const;
/// Decrements the reference count on the underlying chunk of data and empties
/// this handle so that `span()` now returns an empty (zero-sized) span.
///
/// Does not modify the underlying data, but may cause it to be
/// released if this was the only remaining ``Chunk`` referring to it.
/// This is equivalent to ``{ Chunk _unused = std::move(chunk_ref); }``
void Release();
/// Attempts to add `prefix_bytes` bytes to the front of this buffer by
/// advancing its range backwards in memory. Returns `true` if the
/// operation succeeded.
///
/// This will only succeed if this `Chunk` points to a section of a chunk
/// that has unreferenced bytes preceeding it. For example, a `Chunk`
/// which has been shrunk using `DiscardFront` can then be re-expanded using
/// `ClaimPrefix`.
///
/// Note that this will fail if a preceding `Chunk` has claimed this
/// memory using `ClaimSuffix`.
[[nodiscard]] bool ClaimPrefix(size_t prefix_bytes);
/// Attempts to add `suffix_bytes` bytes to the back of this buffer by
/// advancing its range forwards in memory. Returns `true` if the
/// operation succeeded.
///
/// This will only succeed if this `Chunk` points to a section of a chunk
/// that has unreferenced bytes following it. For example, a `Chunk`
/// which has been shrunk using `Truncate` can then be re-expanded using
/// `ClaimSuffix`.
///
/// Note that this will fail if a following `Chunk` has claimed this
/// memory using `ClaimPrefix`.
[[nodiscard]] bool ClaimSuffix(size_t suffix_bytes);
/// Shrinks this handle to refer to the data beginning at offset ``new_start``.
///
/// Does not modify the underlying data.
void DiscardFront(size_t new_start);
/// Shrinks this handle to refer to data in the range ``begin..<end``.
///
/// Does not modify the underlying data.
void Slice(size_t begin, size_t end);
/// Shrinks this handle to refer to only the first ``len`` bytes.
///
/// Does not modify the underlying data.
void Truncate(size_t len);
/// Splits this chunk in two, with the second chunk starting at `split_point`.
std::array<Chunk, 2> Split(size_t split_point) &&;
};
/// A handle to a sequence of potentially non-contiguous refcounted slices of
/// data.
class MultiBuf {
public:
struct Index {
size_t chunk_index;
size_t byte_index;
};
MultiBuf();
/// Creates a ``MultiBuf`` pointing to a single, contiguous chunk of data.
///
/// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory.
static std::optional<MultiBuf> FromChunk(Chunk chunk);
/// Splits the ``MultiBuf`` into two separate buffers at ``split_point``.
///
/// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory.
std::optional<std::array<MultiBuf, 2>> Split(Index split_point) &&;
std::optional<std::array<MultiBuf, 2>> Split(size_t split_point) &&;
/// Appends the contents of ``suffix`` to this ``MultiBuf`` without copying data.
/// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
[[nodiscard]] bool Append(MultiBuf suffix);
/// Discards the first elements of ``MultiBuf`` up to (but not including)
/// ``new_start``.
///
/// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
[[nodiscard]] bool DiscardFront(Index new_start);
[[nodiscard]] bool DiscardFront(size_t new_start);
/// Shifts and truncates this handle to refer to data in the range
/// ``begin..<stop``.
///
/// Does not modify the underlying data.
///
/// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
[[nodiscard]] bool Slice(size_t begin, size_t end);
/// Discards the tail of this ``MultiBuf`` starting with ``first_index_to_drop``.
/// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
[[nodiscard]] bool Truncate(Index first_index_to_drop);
/// Discards the tail of this ``MultiBuf`` so that only ``len`` elements remain.
/// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
[[nodiscard]] bool Truncate(size_t len);
/// Discards the elements beginning with ``cut_start`` and resuming at
/// ``resume_point``.
///
/// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
[[nodiscard]] bool DiscardSegment(Index cut_start, Index resume_point);
/// Returns an iterable over the ``Chunk``s of memory within this ``MultiBuf``.
auto Chunks();
auto Chunks() const;
/// Returns a ``BidirectionalIterator`` over the bytes in this ``MultiBuf``.
///
/// Note that use of this iterator type may be less efficient than
/// performing chunk-wise operations due to the noncontiguous nature of
/// the iterator elements.
auto begin();
auto end();
/// Counts the total number of ``Chunk`` s.
///
/// This function is ``O(n)`` in the number of ``Chunk`` s.
size_t CalculateNumChunks() const;
/// Counts the total size in bytes of all ``Chunk`` s combined.
///
/// This function is ``O(n)`` in the number of ``Chunk`` s.
size_t CalculateTotalSize() const;
/// Returns an ``Index`` which can be used to provide constant-time access to
/// the element at ``position``.
///
/// Any mutation of this ``MultiBuf`` (e.g. ``Append``, ``DiscardFront``,
/// ``Slice``, or ``Truncate``) may invalidate this ``Index``.
Index IndexAt(size_t position) const;
};
class MultiBufAllocationFuture {
public:
Poll<std::optional<Buffer>> Poll(Context&);
};
class MultiBufAllocationFuture {
public:
Poll<std::optional<MultiBuf::Chunk>> Poll(Context&);
};
class MultiBufAllocator {
public:
std::optional<MultiBuf> Allocate(size_t size);
std::optional<MultiBuf> Allocate(size_t min_size, size_t desired_size);
std::optional<MultiBuf::Chunk> AllocateContiguous(size_t size);
std::optional<MultiBuf::Chunk> AllocateContiguous(size_t min_size, size_t desired_size);
MultiBufAllocationFuture AllocateAsync(size_t size);
MultiBufAllocationFuture AllocateAsync(size_t min_size, size_t desired_size);
MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t size);
MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t min_size, size_t desired_size);
protected:
virtual std::optional<MultiBuf> DoAllocate(size_t min_size, size_t desired_size);
virtual std::optional<MultiBuf::Chunk> DoAllocateContiguous(size_t min_size, size_t desired_size);
/// Invoked by the ``MultiBufAllocator`` to signal waiting futures that buffers of
/// the provided sizes may be available for allocation.
void AllocationAvailable(size_t size_available, size_t contiguous_size_available);
};
The ``Chunk`` s in the prototype are stored in fallible dynamically-allocated
arrays, but they could be stored in vectors of a fixed maximum size. The ``Chunk`` s
cannot be stored as an intrusively-linked list because this would require per-``Chunk``
overhead in the underlying buffer, and there may be any number of ``Chunk`` s within
the same buffer.
Multiple ``Chunk`` s may not refer to the same memory, but they may refer to
non-overlapping sections of memory within the same region. When one ``Chunk``
within a region is deallocated, a neighboring chunk may expand to use its space.
--------------------
Vectorized Structure
--------------------
The most significant design choices made in this API is supporting vectorized
IO via a list of ``Chunk`` s. While this does carry an additional overhead, it
is strongly motivated by the desire to carry data throughout the communications
stack without needing a copy. By carrying a list of ``Chunk`` s, ``MultiBuf``
allows data to be prefixed, suffixed, infixed, or split without incurring the
overhead of a separate allocation and copy.
--------------------------------------------------------------------------
Managing Allocations with ``MultiBufAllocator`` and ``ChunkRegionTracker``
--------------------------------------------------------------------------
``pw::MultiBuf`` is not associated with a concrete allocator implementation.
Instead, it delegates the creation of buffers to implementations of
the ``MultiBufAllocator`` base class. This allows different allocator
implementations to vend out ``pw::MultiBuf`` s that are optimized for use with a
particular communications stack.
For example, a communications stack which runs off of shared memory or specific
DMA'able regions might choose to allocate memory out of those regions to allow
for zero-copy writes.
Additionally, custom allocator implementations can reserve headers, footers, or
even split a ``pw::MultiBuf`` into multiple chunks in order to provide zero-copy
fragmentation.
Deallocation of these regions is managed through the ``ChunkRegionTracker``.
When no more ``Chunk`` s associated with a ``ChunkRegionTracker`` exist,
the ``ChunkRegionTracker`` receives a ``Destroy`` call to release both the
region and the ``ChunkRegionTracker``.
The ``MultiBufAllocator`` can place the ``ChunkRegionTracker`` wherever it
wishes, including as a header or footer for the underlying region allocation.
This is not required, however, as memory in regions like shared or DMA'able
memory might be limited, in which case the ``ChunkRegionTracker`` can be
stored elsewhere.
-----------------------------------------------------
Compatibility With Existing Communications Interfaces
-----------------------------------------------------
lwIP
====
`Lightweight IP stack (lwIP)
<https://www.nongnu.org/lwip>`_
is a TCP/IP implementation designed for use on small embedded systems. Some
Pigweed platforms may choose to use lwIP as the backend for their ``Socket``
implementations, so it's important that ``pw::MultiBuf`` interoperate efficiently
with their native buffer type.
lwIP has its own buffer type, `pbuf
<https://www.nongnu.org/lwip/2_1_x/structpbuf.html>`_ optimized for use with
`zero-copy applications
<https://www.nongnu.org/lwip/2_1_x/zerocopyrx.html>`_.
``pbuf`` represents buffers as linked lists of reference-counted ``pbuf`` s
which each have a pointer to their payload, a length, and some metadata. While
this does not precisely match the representation of ``pw::MultiBuf``, it is
possible to construct a ``pbuf`` list which points at the various chunks of a
``pw::MultiBuf`` without needing to perform a copy of the data.
Similarly, a ``pw::MultiBuf`` can refer to a ``pbuf`` by using each ``pbuf`` as
a "``Chunk`` region", holding a reference count on the region's ``pbuf`` so
long as any ``Chunk`` continues to reference the data referenced by that
buffer.
------------------
Existing Solutions
------------------
Linux's ``sk_buff``
===================
Linux has a similar buffer structure called `sk_buff
<https://docs.kernel.org/networking/skbuff.html#struct-sk-buff>`_.
It differs in a few significant ways:
It provides separate ``head``, ``data``, ``tail``, and ``end`` pointers.
Other scatter-gather fragments are supplied using the ``frags[]`` structure.
Separately, it has a ``frags_list`` of IP fragments which is created via calls to
``ip_push_pending_frames``. Fragments are supplied by the ``frags[]`` payload in
addition to the ``skb_shared_info.frag_list`` pointing to the tail.
``sk_buff`` reference-counts not only the underlying chunks of memory, but also the
``sk_buff`` structure itself. This allows for clones of ``sk_buff`` without
manipulating the reference counts of the individual chunks. We anticipate that
cloning a whole ``pw::buffers::MultiBuf`` will be uncommon enough that it is
better to keep these structures single-owner (and mutable) rather than sharing
and reference-counting them.
FreeBSD's ``mbuf``
==================
FreeBSD uses a design called `mbuf
<https://man.freebsd.org/cgi/man.cgi?query=mbuf>`_
which interestingly allows data within the middle of a buffer to be given a
specified alignment, allowing data to be prepended within the same buffer.
This is similar to the structure of ``Chunk``, which may reference data in the
middle of an allocated buffer, allowing prepending without a copy.