Design Sketch for String Support (#82)

Add a design sketch for string support in Emboss.
diff --git a/doc/design_docs/strings.md b/doc/design_docs/strings.md
new file mode 100644
index 0000000..6f4867a
--- /dev/null
+++ b/doc/design_docs/strings.md
@@ -0,0 +1,390 @@
+# String Support for Emboss
+
+GitHub Issue [#28](https://github.com/google/emboss/issues/28)
+
+## Background
+
+It is somewhat common to embed short strings into binary structures; examples
+include serial numbers and firmware revisions, although in some cases even
+things like IP addresses are encoded as ASCII text embedded in a larger binary
+message.
+
+Historically, we have modeled such fields in Emboss by using `UInt:8[]`; that
+is, arrays of 8-bit uints.  This is more-or-less functional, but can be awkward
+for things like text format output, and provides no way to add assertions to
+string fields.
+
+String support is complicated by the fact that there are several common ways of
+delimiting strings:
+
+1.  Length determined by another field -- that is, the size of the string is
+    explicit.
+2.  The string is *terminated* by a specific byte value, usually `'\0'`.  In
+    this case, there may be additional "garbage" bytes after the terminator,
+    which should not be considered to be part of the string.
+3.  The string is *padded* by a specific byte value, usually 32 (`' '`).  In
+    this case, the "padding" character can usually occur inside the string,
+    and only trailing padding characters should be trimmed off.
+
+For both terminated and padded strings, some formats allow the string to run to
+the very end of its field, with no terminator/padding, and some require the
+terminator/padding.  In general, it seems that terminated strings are more
+likely to require the terminator, while padded strings can usually be entered
+with no padding.
+
+There are, no doubt, other ways of delimiting strings.  These seem to be rare
+and sui generis, and can often be handled by modeling them as length-determined
+strings, then applying the necessary logic in code.
+
+There are also multiple *encodings* for strings, such as ASCII, ISO/IEC 8859-1
+("Latin-1"), UTF-8, UTF-16, etc.  UTF-16 seems to be rare outside of
+Windows-based software and Java.  Hardware almost always appears to use ASCII
+(encoded as one character per byte, with the high bit always clear), although
+Java ME-based systems may use UTF-16.
+
+
+## Proposal
+
+### Bytestrings Only
+
+All strings in Emboss should be considered to be opaque blobs of bytes;
+interpretation as ASCII, Latin-1, UTF-8, etc. should be left to the application.
+
+UTF-16 strings are explicitly not handled by this proposal.  In principle, one
+could add a "byte width" parameter to the string types, or use a prefix like `W`
+to indicate "wide string" types, but it does not seem important for now.  This
+decision can be revisited later.
+
+
+### New Built-In Types
+
+Add three new types to the Prelude (names subject to change):
+
+1.  `FixString`, a string whose contents should be the entire field containing
+    the `FixString`.  When writing to a `FixString`, the value must be exactly
+    the same length as the field.
+
+    `CouldWriteValue()` should return `true` for all strings that are exactly
+    the correct length.
+
+    `FixString` is very close to a notional `Blob` type or the current
+    `UInt:8[]` type, except for differences in text format.
+
+2.  `ZString`, a terminated string.  A `ZString` with no arguments uses a null
+    byte (`'\0'`) as the terminator.  An optional argument can be used to
+    specify the terminator -- a `ZString(36)`, for example, would be terminated
+    by `$`.  When reading, the value returned is all bytes up to, but not
+    including, the first terminator byte.  When writing, for compatibility, the
+    entire field should be written, using the terminator value for padding if
+    there is extra space.  A second optional parameter can be used to specify
+    that the terminator is not required: `ZString(0, false)` can fill the
+    underlying field with no terminator.
+
+    `CouldWriteValue()` should return `true` if the value is no longer than the
+    field and the value does not *contain* any instances of the terminator
+    byte.
+
+3.  `PaddedString`, a padded string.  A `PaddedString` with no arguments uses
+    space (`' '`, 32) as the padding value.  An optional argument can be used to
+    specify the padding -- a `PaddedString(0)`, for example, would be padded
+    with null bytes.  When reading, the end of the string is discovered by
+    walking *backwards* from the end until a non-padding byte is found, then
+    returning all bytes from the start of the string to the end.  When writing,
+    any excess bytes will be filled with the padding value.
+
+    Although, technically, "at least one byte of padding" could be enforced by
+    making the `PaddedString` one byte shorter and following it with a one-byte
+    field whose value *must* be the padding byte, for convenience `PaddedString`
+    should take a second optional parameter to specify that the terminator *is*
+    required: `PaddedString(32, true)` must have at least one space at the end.
+
+    `CouldWriteValue()` should return `true` if the value is no longer than the
+    field and the value does not *end with* the padding byte.
+
+
+### String Constants
+
+String constants (used in constructs such as `[requires: this == "abcd"]`) may
+take two forms:
+
+1.  `"A quoted string using C-style escapes like \n"`
+
+    In addition to standard C89 escapes (as interpreted by an ASCII Unix
+    compiler):
+
+    *   `\0` => 0
+    *   `\a` => 7
+    *   `\b` => 8
+    *   `\t` => 9
+    *   `\n` => 10
+    *   `\v` => 11
+    *   `\f` => 12
+    *   `\r` => 13
+    *   `\"` => 34
+    *   `\'` => 39
+    *   `\?` => 63 (part of the C standard, but rarely used)
+    *   `\\` => 92
+    *   <code>\x*hh*</code> => 0x*hh*
+
+    The following non-C-standard escapes should be allowed:
+
+    *   `\e` => 27 (not actually standard, but common)
+    *   <code>\d*nnn*</code> => *nnn*
+    *   <code>\x{*hh*}</code> => 0x*hh*
+    *   <code>\d{*nnn*}</code> => *nnn*
+
+    Note that the standard C escape <code>\\*nnn*</code> is explicitly not
+    supported.  C treats *nnn* as octal, which is often surprising, and modern
+    languages (the cut off date appears to be about 1993 -- right between Python
+    2 and Java) have largely dropped support for the octal escapes.
+
+    Based on a brief survey, only `\n`, `\t`, `\"`, `\\`, and `\'` appear to be
+    (nearly) universal among popular programming languages.  <code>\x*hh*</code>
+    is very common, though not universal.  <code>\u*nnnn*</code>, where *nnnn*
+    is a Unicode hex value to be encoded as UTF-8 or UTF-16, also appears to be
+    common, but only for text strings.
+
+    To avoid ambiguity, the un-braced <code>\x*hh*</code> escape should be
+    required to have 2 hex digits, and the <code>\d*nnn*</code> escape should be
+    required to have exactly 3 decimal digits.  The braced versions --
+    <code>\x{*hh*}</code> and <code>\d{*nnn*}</code> -- could have any number of
+    digits, but should be required to evaluate to a value in the range 0 to 255:
+    that is, `\d{000000100}` should be allowed, but `\d{256}` should not.
+
+    `\` characters should not be allowed outside of the escape sequences
+    specified here.
+
+    For now, only 7-bit ASCII printable characters (byte values 32 through 126)
+    should be allowed in `"quoted strings"`, even though `.emb` files generally
+    allow UTF-8.  This requirement may be relaxed in the future.
+
+2.  A list of bytes in `{}`, where each byte is either a single-quoted character
+    (`'a'`) or a numeric constant (e.g., `0x20` or `32`).
+
+    For ease of transition from existing `UInt:8[]` fields, explicit index
+    markers (`[8]:`) in the list should be allowed if the index exactly matches
+    the current cursor index; this matches output from the current Emboss text
+    format for `UInt:8[]`.
+
+The existing parameter system will need to be extended to allow default values,
+and to allow `external` types to accept parameters if they do not already.
+
+
+### String Field Methods (C++)
+
+#### C++ String Type Parameterization
+
+All methods that accept or return a string value should be templated on the C++
+type to use (`std::string`, `std::string_view`, `char *`, etc.).
+
+For methods that accept a string parameter (`Write`, etc.), the template
+argument should be inferred, and they can be called without specifying the type.
+
+For methods that only return a string value (`Read`, etc.), the template
+argument would need to be specified: `Read<std::string_view>()`.
+
+`char *` should not be accepted as a return type, due to problems with ensuring
+that there is actually a null byte at the end of the string.
+
+As an input type, `char *` is like to need explicit specialization.
+
+In many (most? all?) cases, methods should have no problem with some types that
+are not really "string" types, such as `std::vector<char>`.
+
+String types that use `signed char` or `unsigned char` instead of `char` (e.g.,
+`std::basic_string<unsigned char>`) should be explicitly supported.
+
+If the `BackingStorage` is not `ContiguousBuffer` (or some equivalent), it seems
+that it might be easy to hit undefined behavior with something like
+`Read<std::string_view>()`, since the iterator type returned by `begin()` and
+`end()` would not correctly model `std::contiguous_iterator`.  The cautious
+approach would be to disable `Read()` and `UncheckedRead()` if the backing
+storage is not `ContiguousBuffer`; readout to something like `std::string` could
+still be explicitly performed using the `begin()`/`end()` iterators.
+Alternately, for non-`ContiguousBuffer` backing storage, `Read()` could be
+explicitly limited to a small set of known-good types, such as `std::string` and
+`std::vector<char>`.
+
+
+#### Methods
+
+`Read()`, `UncheckedRead()`, `Write()`, and `UncheckedWrite()` should be defined
+as one would expect.
+
+`ToString()` should be an alias for `Read()`, to ease conversion from
+`UInt:8[]`.
+
+`CouldWriteValue()` should be defined as specified in the previous section.
+
+`Ok()` should return `true` if the string has storage (though it could be
+zero-length storage) and the bytes match the requirements (e.g., if a terminator
+or padding byte is required, `Ok()` should only return `true` if such a byte is
+present).
+
+`Size()` should return the (logical) length of the string in bytes.
+
+`MaxSize()` should return `BackingStorage().SizeInBytes()` or
+`BackingStorage().SizeInBytes() - 1` if the string requires a padding or
+terminator byte.
+
+`begin()`, `end()`, `rbegin()`, `rend()` should be defined as expected for a
+C++ container type.
+
+`operator[]` should return the value of a single byte at the specified offset.
+
+
+#### `emboss::String` Type
+
+(This section should not be considered particularly authoritative; the actual
+implementation could differ greatly if another strategy is turns out to be
+easier or less complex in practice.)
+
+Because values retrieved from the different string types can be used
+interchangeably at the expression layer (e.g., `let s = condition ? z_string :
+fix_string`), there must be a way for all views over strings to return a common
+type.  This is complicated by two requirements:
+
+1.  `emboss::String` should not allocate memory.
+2.  `emboss::String` needs to handle backing storage that is not
+    `ContiguousBuffer`.  It also needs to handle constant strings (`let x =
+    "string"`), and be able to assign `Storage`-based strings to constant
+    strings and vice versa.
+
+To satisfy the first requirement, `emboss::String` will need to hold a reference
+to the underlying storage, not actually copy bytes.
+
+One way to satisfy the second requirement would be to simply copy the string's
+bytes out to a new buffer, but that conflicts with the first requirement.
+Instead, it should be a sum type over a `Storage` type parameter and a constant
+string, like:
+
+```c++
+template <typename Storage>
+class String {
+ public:
+  String();
+  String(const char *data, int size);
+  String(Storage);
+  // ... operator= ...
+  int size() constexpr;
+  char operator[](int index) constexpr {
+    return storage_.Index() == 0 ? backports::Get<0>(storage_)[index]
+                                 : backports::Get<1>(storage_).data()[index];
+  }
+  // ... begin(), end(), etc. ...
+
+ private:
+  // TODO: replace backports::Variant with std::variant in 2027, when Emboss
+  // requires C++17.
+  backports::Variant<const char *, Storage> storage_;
+};
+```
+
+At least for now, `emboss::String` does not need to be exposed as a documented,
+supported API -- user code can use `Read<std::string_view>()` and similar
+operations as needed, with full knowledge of the underlying storage type.
+
+Comparisons and assignments between `emboss::String`s with different `Storage`
+type parameters do not need to be supported, since they cannot be generated by
+the code generator -- C++ codegen would only need those operations for
+`emboss::String`s that are derived from the same parent structure.
+
+
+### Handling in Other Languages
+
+C++ is unusual in that it does not differentiate at a language level between
+text strings and byte strings.  Most other languages have different types for
+byte strings and text strings.
+
+For all languages that differentiate, Emboss strings should be treated as byte
+strings or byte arrays (Python3 `bytes`, Rust `Vec<u8>`, Proto `bytes`, etc.)
+
+Other than this caveat, Emboss string support should be straightforward in other
+languages.
+
+
+### Text Format
+
+Text format output should use the `"quoted string"` style.  Byte values outside
+the range 32 through 126 should be emitted as escapes.  Values with standard
+shorthand escapes (10 => `'\n'`, 0 => `'\0'`, etc.) should be emitted as such.
+For other values, hex escapes with exactly two digits (e.g., `\x06`, not `\x6`)
+should be emitted.  It may be desirable to allow some `[text_format]` control
+over the output in the future.
+
+Text format input should allow both `"quoted string"` and list-of-bytes styles,
+with exactly the same rules as string constants in an `.emb` file, except that
+bytes > 126 might be allowed in a `"quoted string"`.
+
+
+### Expressions
+
+#### Type System Changes
+
+In order to facilitate `[requires]` on string types, the new types should have a
+new 'string' expression type.
+
+
+#### Runtime Representation
+
+In this proposal, no string manipulation are allowed, so temporary strings
+(which might require memory allocation) will not be necessary.
+
+
+#### String Attribute Representation
+
+Attributes values are currently represented by a special `AttributeValue` type
+which can hold either an `Expression` or a `String`.  With a string expression
+type, `AttributeValue` can be replaced by a plain `Expression`.  This will
+require changes to everything that touches `AttributeValue`.
+
+Alternately, `AttributeValue` could be left in the IR with only `Expression`,
+in which case only code that touches string attributes (`[byte_order]` and
+`[(cpp) namespace]`) needs to change.
+
+
+#### String Comparisons
+
+Comparison operations (`==`, `<`, `>`, `>=`, `<=`, `!=`) should be allowed,
+since these can be handled by passing references to existing memory.
+
+Equality and inequality (`==` and `!=`) should be defined in the expected way:
+two strings are equal iff they are the same length and the corresponding bytes
+in each string have the same value, and they are unequal if they are not equal.
+
+For ordering, strings should be compared lexically, using the binary value of
+each byte, with no regard for semantic collation.  That is, `"Z" < "a"`, since
+`'Z'` is 90 and `'a'` is 97.
+
+When one string is a strict prefix of another string, the shorter string should
+be "less than" the longer; e.g., `"abc" < "abcdef"`.  This is the same as the
+natural ordering for zero-terminated strings.
+
+
+#### Future String Operations
+
+It may be desirable, at some future point, to allow various string
+manipulations, such as concatenation or repetition, at least for compile-time
+strings.
+
+A substring operation should be possible without requiring memory allocation.
+
+Indexing into a string (`str[offset]`) should be allowed if/when indexing into
+an array is finally supported.
+
+
+### Arrays of Strings
+
+In some cases, it may be desirable to have an array of strings, like:
+
+```
+struct Foo:
+  0 [+100]  ZString[10]  list
+```
+
+Although somewhat awkward, the existing explicit-length syntax should work:
+
+```
+struct Foo:
+  0 [+100]  ZString:80[10]  list  # 10 10-byte (80-bit) strings
+```