| .. _seed-0109: |
| |
| =========================== |
| 0109: Communication Buffers |
| =========================== |
| .. seed:: |
| :number: 109 |
| :name: Communication Buffers |
| :status: Accepted |
| :proposal_date: 2023-08-28 |
| :cl: 168357 |
| :authors: Taylor Cramer |
| :facilitator: Erik Gilling |
| |
| ------- |
| Summary |
| ------- |
| This SEED proposes that Pigweed adopt a standard buffer type for network-style |
| communications. This buffer type will be used in the new sockets API |
| (see the recently-accepted `Communications SEED 107 |
| <https://pigweed.dev/seed/0107-communications.html>`_ for more details), and |
| will be carried throughout the communications stack as-appropriate. |
| |
| ------------------------ |
| Top-Level Usage Examples |
| ------------------------ |
| |
| Sending a Proto Into a Socket |
| ============================= |
| .. code-block:: cpp |
| |
| Allocator& msg_allocator = socket.Allocator(); |
| size_t size = EncodedProtoSize(some_proto); |
| std::optional<pw::MultiBuf> buffer = msg_allocator.Allocate(size); |
| if (!buffer) { return; } |
| EncodeProto(some_proto, *buffer); |
| socket.Send(std::move(*buffer)); |
| |
| ``Socket`` s provide an allocator which should be used to create a |
| ``pw::MultiBuf``. The data to be sent is then filled into this buffer before |
| being sent. ``Socket`` must accept ``pw::MultiBuf`` s which were not created |
| by their own ``Allocator``, but this may incur a performance penalty. |
| |
| Zero-Copy Fragmentation |
| ======================= |
| |
| The example above has a hidden feature: zero-copy fragmentation! The socket |
| can provide a ``pw::MultiBuf`` which is divided up into separate MTU-sized |
| chunks, each of which has reserved space for headers and/or footers: |
| |
| .. code-block:: cpp |
| |
| std::optional<pw::MultiBuf> Allocate(size_t size) { |
| if (size == 0) { return pw::MultiBuf(); } |
| size_t data_per_chunk = mtu_size_ - header_size_; |
| size_t num_chunks = 1 + ((size - 1) / data_per_chunk); |
| std::optional<pw::MultiBuf> buffer = pw::MultiBuf::WithCapacity(num_chunks); |
| if (!buffer) { return std::nullopt; } |
| for (size_t i = 0; i < num_chunks; i++) { |
| // Note: we could allocate smaller than `mtu_size_` for the final |
| // chunk. This is ommitted here for brevity. |
| std::optional<pw::MultiBuf::Chunk> chunk = internal_alloc_.Allocate(mtu_size_); |
| if (!chunk) { return std::nullopt; } |
| // Reserve the header size by discarding the front of each buffer. |
| chunk->DiscardFront(header_size_); |
| buffer->Chunks()[i] = std::move(chunk); |
| } |
| return *buffer; |
| } |
| |
| This code reserves ``header_size_`` bytes at the beginning of each ``Chunk``. |
| When the caller writes into this memory and then passes it back to the socket, |
| these bytes can be claimed for the header using the ``ClaimPrefix`` function. |
| |
| One usecase that seems to demand the ability to fragment like this is breaking |
| up ``SOCK_SEQPACKET`` messages which (at least on Unix / Linux) may be much larger |
| than the MTU size: up to ``SO_SNDBUF`` (see `this man page |
| <https://man7.org/linux/man-pages/man7/socket.7.html>`_). |
| |
| Multiplexing Incoming Data |
| ========================== |
| |
| .. code-block:: cpp |
| |
| [[nodiscard]] bool SplitAndSend(pw::MultiBuf buffer) { |
| std::optional<std::array<pw::MultiBuf, 2>> buffers = |
| std::move(buffer).Split(split_index); |
| if (!buffers) { return false; } |
| socket_1_.Send(std::move(buffers->at(0))); |
| socket_2_.Send(std::move(buffers->at(1))); |
| return true; |
| } |
| |
| Incoming buffers can be split without copying, and the results can be forwarded |
| to multiple different ``Socket`` s, RPC services or clients. |
| |
| ---------- |
| Motivation |
| ---------- |
| Today, a Pigweed communications stack typically involves a number of different |
| buffers. |
| |
| ``pw_rpc`` users, for example, frequently use direct-memory access (DMA) or |
| other system primitives to read data into a buffer, apply some link-layer |
| protocol such as HDLC which copies data into a second buffer, pass the resulting |
| data into pw_rpc which parses it into its own buffer. Multiple sets of buffers |
| are present on the output side as well. Between DMA in and DMA out, it's easy |
| for data to pass through six or more different buffers. |
| |
| These independent buffer systems introduce both time and space overhead. Aside |
| from the additional CPU time required to move the data around, users need to |
| ensure that all of the different buffer pools involved along the way have enough |
| space reserved to contain the entire message. Where caching is present, moving |
| the memory between locations may create an additional delay by thrashing |
| between memory regions. |
| |
| -------- |
| Proposal |
| -------- |
| ``pw::buffers::MultiBuf`` is a handle to a buffer optimized for use within |
| Pigweed's communications stack. It provides efficient, low-overhead buffer |
| management, and serves as a standard type for passing data between drivers, |
| TCP/IP implementations, RPC 2.0, and transfer 2.0. |
| |
| A single ``MultiBuf`` is a list of ``Chunk`` s of data. Each ``Chunk`` |
| points to an exclusively-owned portion of a reference-counted allocation. |
| ``MultiBuf`` s can be easily split, joined, prefixed, postfixed, or infixed |
| without needing to copy the underlying data. |
| |
| The memory pointed to by ``Chunk`` s is typically allocated from a pool |
| provided by a ``Socket``. This allows the ``Socket`` to provide backpressure to |
| callers, and to ensure that memory is placed in DMA or shared memory regions |
| as-necessary. |
| |
| In-Memory Layout |
| ================ |
| |
| This diagram shows an example in-memory relationship between two buffers: |
| - ``Buffer1`` references one chunks from region A. |
| - ``Buffer2`` references two chunk from regions A and B. |
| |
| .. mermaid:: |
| |
| graph TB; |
| Buffer1 --> Chunk1A |
| Chunk1A -- "[0..64]" --> MemoryRegionA(Memory Region A) |
| Chunk1A --> ChunkRegionTrackerA |
| Buffer2 --> Chunk2A & Chunk2B |
| Chunk2A --> ChunkRegionTrackerA |
| Chunk2A -- "[64..128]" --> MemoryRegionA(Memory Region A) |
| Chunk2B -- "[0..128]" --> MemoryRegionB |
| Chunk2B --> ChunkRegionTrackerB |
| |
| API |
| === |
| |
| The primary API is as follows: |
| |
| .. code-block:: cpp |
| |
| /// An object that manages a single allocated region which is referenced |
| /// by one or more chunks. |
| class ChunkRegionTracker { |
| public: |
| ChunkRegionTracker(ByteSpan); |
| |
| /// Creates the first ``Chunk`` referencing a whole region of memory. |
| /// This must only be called once per ``ChunkRegionTracker``. |
| Chunk ChunkForRegion(); |
| |
| protected: |
| pw::Mutex lock(); |
| |
| /// Destroys the `ChunkRegionTracker`. |
| /// |
| /// Typical implementations will call `std::destroy_at(this)` and then |
| /// free the memory associated with the tracker. |
| virtual void Destroy(); |
| |
| /// Defines the entire span of the region being managed. ``Chunk`` s |
| /// referencing this tracker may not expand beyond this region |
| /// (or into one another). |
| /// |
| /// This region must not change for the lifetime of this |
| /// ``ChunkRegionTracker``. |
| virtual ByteSpan region(); |
| |
| private: |
| /// Used to manage the internal linked-list of ``Chunk`` s that allows |
| /// chunks to know whether or not they can expand to fill neighboring |
| /// regions of memory. |
| pw::Mutex lock_; |
| }; |
| |
| /// A handle to a contiguous refcounted slice of data. |
| /// |
| /// Note: this Chunk may acquire a ``pw::sync::Mutex`` during destruction, and |
| /// so must not be destroyed within ISR context. |
| class Chunk { |
| public: |
| Chunk(); |
| Chunk(ChunkRegionTracker&); |
| Chunk(Chunk&& other) noexcept; |
| Chunk& operator=(Chunk&& other); |
| ~Chunk(); |
| std::byte* data(); |
| const std::byte* data() const; |
| ByteSpan span(); |
| ConstByteSpan span() const; |
| size_t size() const; |
| |
| std::byte& operator[](size_t index); |
| std::byte operator[](size_t index) const; |
| |
| /// Decrements the reference count on the underlying chunk of data and empties |
| /// this handle so that `span()` now returns an empty (zero-sized) span. |
| /// |
| /// Does not modify the underlying data, but may cause it to be |
| /// released if this was the only remaining ``Chunk`` referring to it. |
| /// This is equivalent to ``{ Chunk _unused = std::move(chunk_ref); }`` |
| void Release(); |
| |
| /// Attempts to add `prefix_bytes` bytes to the front of this buffer by |
| /// advancing its range backwards in memory. Returns `true` if the |
| /// operation succeeded. |
| /// |
| /// This will only succeed if this `Chunk` points to a section of a chunk |
| /// that has unreferenced bytes preceeding it. For example, a `Chunk` |
| /// which has been shrunk using `DiscardFront` can then be re-expanded using |
| /// `ClaimPrefix`. |
| /// |
| /// Note that this will fail if a preceding `Chunk` has claimed this |
| /// memory using `ClaimSuffix`. |
| [[nodiscard]] bool ClaimPrefix(size_t prefix_bytes); |
| |
| /// Attempts to add `suffix_bytes` bytes to the back of this buffer by |
| /// advancing its range forwards in memory. Returns `true` if the |
| /// operation succeeded. |
| /// |
| /// This will only succeed if this `Chunk` points to a section of a chunk |
| /// that has unreferenced bytes following it. For example, a `Chunk` |
| /// which has been shrunk using `Truncate` can then be re-expanded using |
| /// `ClaimSuffix`. |
| /// |
| /// Note that this will fail if a following `Chunk` has claimed this |
| /// memory using `ClaimPrefix`. |
| [[nodiscard]] bool ClaimSuffix(size_t suffix_bytes); |
| |
| /// Shrinks this handle to refer to the data beginning at offset ``new_start``. |
| /// |
| /// Does not modify the underlying data. |
| void DiscardFront(size_t new_start); |
| |
| /// Shrinks this handle to refer to data in the range ``begin..<end``. |
| /// |
| /// Does not modify the underlying data. |
| void Slice(size_t begin, size_t end); |
| |
| /// Shrinks this handle to refer to only the first ``len`` bytes. |
| /// |
| /// Does not modify the underlying data. |
| void Truncate(size_t len); |
| |
| /// Splits this chunk in two, with the second chunk starting at `split_point`. |
| std::array<Chunk, 2> Split(size_t split_point) &&; |
| }; |
| |
| /// A handle to a sequence of potentially non-contiguous refcounted slices of |
| /// data. |
| class MultiBuf { |
| public: |
| struct Index { |
| size_t chunk_index; |
| size_t byte_index; |
| }; |
| |
| MultiBuf(); |
| |
| /// Creates a ``MultiBuf`` pointing to a single, contiguous chunk of data. |
| /// |
| /// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory. |
| static std::optional<MultiBuf> FromChunk(Chunk chunk); |
| |
| /// Splits the ``MultiBuf`` into two separate buffers at ``split_point``. |
| /// |
| /// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory. |
| std::optional<std::array<MultiBuf, 2>> Split(Index split_point) &&; |
| std::optional<std::array<MultiBuf, 2>> Split(size_t split_point) &&; |
| |
| /// Appends the contents of ``suffix`` to this ``MultiBuf`` without copying data. |
| /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. |
| [[nodiscard]] bool Append(MultiBuf suffix); |
| |
| /// Discards the first elements of ``MultiBuf`` up to (but not including) |
| /// ``new_start``. |
| /// |
| /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. |
| [[nodiscard]] bool DiscardFront(Index new_start); |
| [[nodiscard]] bool DiscardFront(size_t new_start); |
| |
| /// Shifts and truncates this handle to refer to data in the range |
| /// ``begin..<stop``. |
| /// |
| /// Does not modify the underlying data. |
| /// |
| /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. |
| [[nodiscard]] bool Slice(size_t begin, size_t end); |
| |
| /// Discards the tail of this ``MultiBuf`` starting with ``first_index_to_drop``. |
| /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. |
| [[nodiscard]] bool Truncate(Index first_index_to_drop); |
| /// Discards the tail of this ``MultiBuf`` so that only ``len`` elements remain. |
| /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. |
| [[nodiscard]] bool Truncate(size_t len); |
| |
| /// Discards the elements beginning with ``cut_start`` and resuming at |
| /// ``resume_point``. |
| /// |
| /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. |
| [[nodiscard]] bool DiscardSegment(Index cut_start, Index resume_point); |
| |
| /// Returns an iterable over the ``Chunk``s of memory within this ``MultiBuf``. |
| auto Chunks(); |
| auto Chunks() const; |
| |
| /// Returns a ``BidirectionalIterator`` over the bytes in this ``MultiBuf``. |
| /// |
| /// Note that use of this iterator type may be less efficient than |
| /// performing chunk-wise operations due to the noncontiguous nature of |
| /// the iterator elements. |
| auto begin(); |
| auto end(); |
| |
| /// Counts the total number of ``Chunk`` s. |
| /// |
| /// This function is ``O(n)`` in the number of ``Chunk`` s. |
| size_t CalculateNumChunks() const; |
| |
| /// Counts the total size in bytes of all ``Chunk`` s combined. |
| /// |
| /// This function is ``O(n)`` in the number of ``Chunk`` s. |
| size_t CalculateTotalSize() const; |
| |
| /// Returns an ``Index`` which can be used to provide constant-time access to |
| /// the element at ``position``. |
| /// |
| /// Any mutation of this ``MultiBuf`` (e.g. ``Append``, ``DiscardFront``, |
| /// ``Slice``, or ``Truncate``) may invalidate this ``Index``. |
| Index IndexAt(size_t position) const; |
| }; |
| |
| |
| class MultiBufAllocationFuture { |
| public: |
| Poll<std::optional<Buffer>> Poll(Context&); |
| }; |
| |
| class MultiBufAllocationFuture { |
| public: |
| Poll<std::optional<MultiBuf::Chunk>> Poll(Context&); |
| }; |
| |
| class MultiBufAllocator { |
| public: |
| std::optional<MultiBuf> Allocate(size_t size); |
| std::optional<MultiBuf> Allocate(size_t min_size, size_t desired_size); |
| std::optional<MultiBuf::Chunk> AllocateContiguous(size_t size); |
| std::optional<MultiBuf::Chunk> AllocateContiguous(size_t min_size, size_t desired_size); |
| |
| MultiBufAllocationFuture AllocateAsync(size_t size); |
| MultiBufAllocationFuture AllocateAsync(size_t min_size, size_t desired_size); |
| MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t size); |
| MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t min_size, size_t desired_size); |
| |
| protected: |
| virtual std::optional<MultiBuf> DoAllocate(size_t min_size, size_t desired_size); |
| virtual std::optional<MultiBuf::Chunk> DoAllocateContiguous(size_t min_size, size_t desired_size); |
| |
| /// Invoked by the ``MultiBufAllocator`` to signal waiting futures that buffers of |
| /// the provided sizes may be available for allocation. |
| void AllocationAvailable(size_t size_available, size_t contiguous_size_available); |
| }; |
| |
| |
| The ``Chunk`` s in the prototype are stored in fallible dynamically-allocated |
| arrays, but they could be stored in vectors of a fixed maximum size. The ``Chunk`` s |
| cannot be stored as an intrusively-linked list because this would require per-``Chunk`` |
| overhead in the underlying buffer, and there may be any number of ``Chunk`` s within |
| the same buffer. |
| |
| Multiple ``Chunk`` s may not refer to the same memory, but they may refer to |
| non-overlapping sections of memory within the same region. When one ``Chunk`` |
| within a region is deallocated, a neighboring chunk may expand to use its space. |
| |
| -------------------- |
| Vectorized Structure |
| -------------------- |
| The most significant design choices made in this API is supporting vectorized |
| IO via a list of ``Chunk`` s. While this does carry an additional overhead, it |
| is strongly motivated by the desire to carry data throughout the communications |
| stack without needing a copy. By carrying a list of ``Chunk`` s, ``MultiBuf`` |
| allows data to be prefixed, suffixed, infixed, or split without incurring the |
| overhead of a separate allocation and copy. |
| |
| -------------------------------------------------------------------------- |
| Managing Allocations with ``MultiBufAllocator`` and ``ChunkRegionTracker`` |
| -------------------------------------------------------------------------- |
| ``pw::MultiBuf`` is not associated with a concrete allocator implementation. |
| Instead, it delegates the creation of buffers to implementations of |
| the ``MultiBufAllocator`` base class. This allows different allocator |
| implementations to vend out ``pw::MultiBuf`` s that are optimized for use with a |
| particular communications stack. |
| |
| For example, a communications stack which runs off of shared memory or specific |
| DMA'able regions might choose to allocate memory out of those regions to allow |
| for zero-copy writes. |
| |
| Additionally, custom allocator implementations can reserve headers, footers, or |
| even split a ``pw::MultiBuf`` into multiple chunks in order to provide zero-copy |
| fragmentation. |
| |
| Deallocation of these regions is managed through the ``ChunkRegionTracker``. |
| When no more ``Chunk`` s associated with a ``ChunkRegionTracker`` exist, |
| the ``ChunkRegionTracker`` receives a ``Destroy`` call to release both the |
| region and the ``ChunkRegionTracker``. |
| |
| The ``MultiBufAllocator`` can place the ``ChunkRegionTracker`` wherever it |
| wishes, including as a header or footer for the underlying region allocation. |
| This is not required, however, as memory in regions like shared or DMA'able |
| memory might be limited, in which case the ``ChunkRegionTracker`` can be |
| stored elsewhere. |
| |
| ----------------------------------------------------- |
| Compatibility With Existing Communications Interfaces |
| ----------------------------------------------------- |
| |
| lwIP |
| ==== |
| `Lightweight IP stack (lwIP) |
| <https://www.nongnu.org/lwip>`_ |
| is a TCP/IP implementation designed for use on small embedded systems. Some |
| Pigweed platforms may choose to use lwIP as the backend for their ``Socket`` |
| implementations, so it's important that ``pw::MultiBuf`` interoperate efficiently |
| with their native buffer type. |
| |
| lwIP has its own buffer type, `pbuf |
| <https://www.nongnu.org/lwip/2_1_x/structpbuf.html>`_ optimized for use with |
| `zero-copy applications |
| <https://www.nongnu.org/lwip/2_1_x/zerocopyrx.html>`_. |
| ``pbuf`` represents buffers as linked lists of reference-counted ``pbuf`` s |
| which each have a pointer to their payload, a length, and some metadata. While |
| this does not precisely match the representation of ``pw::MultiBuf``, it is |
| possible to construct a ``pbuf`` list which points at the various chunks of a |
| ``pw::MultiBuf`` without needing to perform a copy of the data. |
| |
| Similarly, a ``pw::MultiBuf`` can refer to a ``pbuf`` by using each ``pbuf`` as |
| a "``Chunk`` region", holding a reference count on the region's ``pbuf`` so |
| long as any ``Chunk`` continues to reference the data referenced by that |
| buffer. |
| |
| ------------------ |
| Existing Solutions |
| ------------------ |
| |
| Linux's ``sk_buff`` |
| =================== |
| Linux has a similar buffer structure called `sk_buff |
| <https://docs.kernel.org/networking/skbuff.html#struct-sk-buff>`_. |
| It differs in a few significant ways: |
| |
| It provides separate ``head``, ``data``, ``tail``, and ``end`` pointers. |
| Other scatter-gather fragments are supplied using the ``frags[]`` structure. |
| |
| Separately, it has a ``frags_list`` of IP fragments which is created via calls to |
| ``ip_push_pending_frames``. Fragments are supplied by the ``frags[]`` payload in |
| addition to the ``skb_shared_info.frag_list`` pointing to the tail. |
| |
| ``sk_buff`` reference-counts not only the underlying chunks of memory, but also the |
| ``sk_buff`` structure itself. This allows for clones of ``sk_buff`` without |
| manipulating the reference counts of the individual chunks. We anticipate that |
| cloning a whole ``pw::buffers::MultiBuf`` will be uncommon enough that it is |
| better to keep these structures single-owner (and mutable) rather than sharing |
| and reference-counting them. |
| |
| FreeBSD's ``mbuf`` |
| ================== |
| FreeBSD uses a design called `mbuf |
| <https://man.freebsd.org/cgi/man.cgi?query=mbuf>`_ |
| which interestingly allows data within the middle of a buffer to be given a |
| specified alignment, allowing data to be prepended within the same buffer. |
| This is similar to the structure of ``Chunk``, which may reference data in the |
| middle of an allocated buffer, allowing prepending without a copy. |