Remove `Reader::ReadOrPullSome()` with the corresponding virtual functions:
`Reader::ReadOrPullSomeSlow()` and
`PullableReader::ReadOrPullSomeBehindScratch()`.

Instead, override `Reader::ReadSomeSlow(char*)` and
`Reader::CopySomeSlow(Writer&)`, which are now virtual.

Advantages:

* The overriding API is much more intuitive.

  `ReadOrPullSome()` was weird in how it allowed the `Reader` to choose whether
  to expose its data or to write to a destination buffer, how it allowed the
  `Reader` to communicate a tighter `max_length` to the destination, and how it
  allowed the destination to effectively override the choice if it needs the
  data in a particular place.

  The choice influences whether the destination should be asked for a buffer
  at all, and how large, hence `ReadOrPullSome()` used a callback for this.

  Now, `ReadSome(char*)` always copies to the destination buffer, and for
  `CopySome(Writer&)` the `Reader` calls `Writer::Write(absl::string_view)`,
  `Writer::Write(ExternalRef)`, or `Writer::Push()` as appropriate.

* In `ReadSome(std::string&)`, string allocation is done together with filling,
  which will allow to use `absl::StringResizeAndOverwrite()` to avoid prefilling
  with zeros.

Disadvantages:

* There are two functions to override instead of one, which leads to some
  duplication of logic.

* `ReadSome(std::string&)` preallocates the whole `max_length`, instead of
  allowing the `Reader` to communicate if the maximum needed is smaller.
  A `Writer` is also hinted for the whole `max_length`. That optimization is
  deemed not worth the API complication.

Other changes:

* Strengthen the preconditions of `ReadSomeSlow()` and `CopySomeSlow()` from
  `available() < max_length` to `available() == 0`. Other cases are handled
  by non-virtual `ReadSome()` and `CopySome()`, by delegating to `Read()` or
  `Copy()` with `max_length` limited to `available()`.

  This makes overrides simpler, at the cost of potentially losing optimizations
  to write data from multiple flat buffers at one call.

* Clean up comments about which `Reader` functions are implemented in terms of
  which ones. This includes all non-virtual functions instead of some of them.
  This can be important to determine which functions can be called by overrides
  to avoid infinite recursion.

* Move length overflow checks from `ReadSlow()` to `ReadAndAppend()`, and
  remove private `ReadSlowWithSizeCheck()` functions. Relevant usages have been
  already moved from `reader.h` to `reader.cc`, so there is no need to put them
  in separate functions.

* Move `move_cursor()` calls above `dest.Write()` if this allows to call
  `dest.Write()` in a tail position.

* Cosmetic change in `riegeli::tensorflow::File{Reader,Writer}Base::OpenFile()`:
  allow NRVO by always returning the same local variable.

PiperOrigin-RevId: 858222599
25 files changed
tree: 3afeef4d74ef0b55c59b21f6408552f59eedf70b
  1. doc/
  2. python/
  3. riegeli/
  4. tf_dependency/
  5. .bazelrc
  6. configure
  7. CONTRIBUTING.md
  8. LICENSE
  9. MANIFEST.in
  10. MODULE.bazel
  11. README.md
README.md

Riegeli

Riegeli/records is a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding.

See documentation.

Status

Riegeli file format will only change in a backward compatible way (i.e. future readers will understand current files, but current readers might not understand files using future features).

Riegeli C++ API might change in incompatible ways.