Design Sketch for String Support (#82)
Add a design sketch for string support in Emboss.
diff --git a/doc/design_docs/strings.md b/doc/design_docs/strings.md
new file mode 100644
index 0000000..6f4867a
--- /dev/null
+++ b/doc/design_docs/strings.md
@@ -0,0 +1,390 @@
+# String Support for Emboss
+
+GitHub Issue [#28](https://github.com/google/emboss/issues/28)
+
+## Background
+
+It is somewhat common to embed short strings into binary structures; examples
+include serial numbers and firmware revisions, although in some cases even
+things like IP addresses are encoded as ASCII text embedded in a larger binary
+message.
+
+Historically, we have modeled such fields in Emboss by using `UInt:8[]`; that
+is, arrays of 8-bit uints. This is more-or-less functional, but can be awkward
+for things like text format output, and provides no way to add assertions to
+string fields.
+
+String support is complicated by the fact that there are several common ways of
+delimiting strings:
+
+1. Length determined by another field -- that is, the size of the string is
+ explicit.
+2. The string is *terminated* by a specific byte value, usually `'\0'`. In
+ this case, there may be additional "garbage" bytes after the terminator,
+ which should not be considered to be part of the string.
+3. The string is *padded* by a specific byte value, usually 32 (`' '`). In
+ this case, the "padding" character can usually occur inside the string,
+ and only trailing padding characters should be trimmed off.
+
+For both terminated and padded strings, some formats allow the string to run to
+the very end of its field, with no terminator/padding, and some require the
+terminator/padding. In general, it seems that terminated strings are more
+likely to require the terminator, while padded strings can usually be entered
+with no padding.
+
+There are, no doubt, other ways of delimiting strings. These seem to be rare
+and sui generis, and can often be handled by modeling them as length-determined
+strings, then applying the necessary logic in code.
+
+There are also multiple *encodings* for strings, such as ASCII, ISO/IEC 8859-1
+("Latin-1"), UTF-8, UTF-16, etc. UTF-16 seems to be rare outside of
+Windows-based software and Java. Hardware almost always appears to use ASCII
+(encoded as one character per byte, with the high bit always clear), although
+Java ME-based systems may use UTF-16.
+
+
+## Proposal
+
+### Bytestrings Only
+
+All strings in Emboss should be considered to be opaque blobs of bytes;
+interpretation as ASCII, Latin-1, UTF-8, etc. should be left to the application.
+
+UTF-16 strings are explicitly not handled by this proposal. In principle, one
+could add a "byte width" parameter to the string types, or use a prefix like `W`
+to indicate "wide string" types, but it does not seem important for now. This
+decision can be revisited later.
+
+
+### New Built-In Types
+
+Add three new types to the Prelude (names subject to change):
+
+1. `FixString`, a string whose contents should be the entire field containing
+ the `FixString`. When writing to a `FixString`, the value must be exactly
+ the same length as the field.
+
+ `CouldWriteValue()` should return `true` for all strings that are exactly
+ the correct length.
+
+ `FixString` is very close to a notional `Blob` type or the current
+ `UInt:8[]` type, except for differences in text format.
+
+2. `ZString`, a terminated string. A `ZString` with no arguments uses a null
+ byte (`'\0'`) as the terminator. An optional argument can be used to
+ specify the terminator -- a `ZString(36)`, for example, would be terminated
+ by `$`. When reading, the value returned is all bytes up to, but not
+ including, the first terminator byte. When writing, for compatibility, the
+ entire field should be written, using the terminator value for padding if
+ there is extra space. A second optional parameter can be used to specify
+ that the terminator is not required: `ZString(0, false)` can fill the
+ underlying field with no terminator.
+
+ `CouldWriteValue()` should return `true` if the value is no longer than the
+ field and the value does not *contain* any instances of the terminator
+ byte.
+
+3. `PaddedString`, a padded string. A `PaddedString` with no arguments uses
+ space (`' '`, 32) as the padding value. An optional argument can be used to
+ specify the padding -- a `PaddedString(0)`, for example, would be padded
+ with null bytes. When reading, the end of the string is discovered by
+ walking *backwards* from the end until a non-padding byte is found, then
+ returning all bytes from the start of the string to the end. When writing,
+ any excess bytes will be filled with the padding value.
+
+ Although, technically, "at least one byte of padding" could be enforced by
+ making the `PaddedString` one byte shorter and following it with a one-byte
+ field whose value *must* be the padding byte, for convenience `PaddedString`
+ should take a second optional parameter to specify that the terminator *is*
+ required: `PaddedString(32, true)` must have at least one space at the end.
+
+ `CouldWriteValue()` should return `true` if the value is no longer than the
+ field and the value does not *end with* the padding byte.
+
+
+### String Constants
+
+String constants (used in constructs such as `[requires: this == "abcd"]`) may
+take two forms:
+
+1. `"A quoted string using C-style escapes like \n"`
+
+ In addition to standard C89 escapes (as interpreted by an ASCII Unix
+ compiler):
+
+ * `\0` => 0
+ * `\a` => 7
+ * `\b` => 8
+ * `\t` => 9
+ * `\n` => 10
+ * `\v` => 11
+ * `\f` => 12
+ * `\r` => 13
+ * `\"` => 34
+ * `\'` => 39
+ * `\?` => 63 (part of the C standard, but rarely used)
+ * `\\` => 92
+ * <code>\x*hh*</code> => 0x*hh*
+
+ The following non-C-standard escapes should be allowed:
+
+ * `\e` => 27 (not actually standard, but common)
+ * <code>\d*nnn*</code> => *nnn*
+ * <code>\x{*hh*}</code> => 0x*hh*
+ * <code>\d{*nnn*}</code> => *nnn*
+
+ Note that the standard C escape <code>\\*nnn*</code> is explicitly not
+ supported. C treats *nnn* as octal, which is often surprising, and modern
+ languages (the cut off date appears to be about 1993 -- right between Python
+ 2 and Java) have largely dropped support for the octal escapes.
+
+ Based on a brief survey, only `\n`, `\t`, `\"`, `\\`, and `\'` appear to be
+ (nearly) universal among popular programming languages. <code>\x*hh*</code>
+ is very common, though not universal. <code>\u*nnnn*</code>, where *nnnn*
+ is a Unicode hex value to be encoded as UTF-8 or UTF-16, also appears to be
+ common, but only for text strings.
+
+ To avoid ambiguity, the un-braced <code>\x*hh*</code> escape should be
+ required to have 2 hex digits, and the <code>\d*nnn*</code> escape should be
+ required to have exactly 3 decimal digits. The braced versions --
+ <code>\x{*hh*}</code> and <code>\d{*nnn*}</code> -- could have any number of
+ digits, but should be required to evaluate to a value in the range 0 to 255:
+ that is, `\d{000000100}` should be allowed, but `\d{256}` should not.
+
+ `\` characters should not be allowed outside of the escape sequences
+ specified here.
+
+ For now, only 7-bit ASCII printable characters (byte values 32 through 126)
+ should be allowed in `"quoted strings"`, even though `.emb` files generally
+ allow UTF-8. This requirement may be relaxed in the future.
+
+2. A list of bytes in `{}`, where each byte is either a single-quoted character
+ (`'a'`) or a numeric constant (e.g., `0x20` or `32`).
+
+ For ease of transition from existing `UInt:8[]` fields, explicit index
+ markers (`[8]:`) in the list should be allowed if the index exactly matches
+ the current cursor index; this matches output from the current Emboss text
+ format for `UInt:8[]`.
+
+The existing parameter system will need to be extended to allow default values,
+and to allow `external` types to accept parameters if they do not already.
+
+
+### String Field Methods (C++)
+
+#### C++ String Type Parameterization
+
+All methods that accept or return a string value should be templated on the C++
+type to use (`std::string`, `std::string_view`, `char *`, etc.).
+
+For methods that accept a string parameter (`Write`, etc.), the template
+argument should be inferred, and they can be called without specifying the type.
+
+For methods that only return a string value (`Read`, etc.), the template
+argument would need to be specified: `Read<std::string_view>()`.
+
+`char *` should not be accepted as a return type, due to problems with ensuring
+that there is actually a null byte at the end of the string.
+
+As an input type, `char *` is like to need explicit specialization.
+
+In many (most? all?) cases, methods should have no problem with some types that
+are not really "string" types, such as `std::vector<char>`.
+
+String types that use `signed char` or `unsigned char` instead of `char` (e.g.,
+`std::basic_string<unsigned char>`) should be explicitly supported.
+
+If the `BackingStorage` is not `ContiguousBuffer` (or some equivalent), it seems
+that it might be easy to hit undefined behavior with something like
+`Read<std::string_view>()`, since the iterator type returned by `begin()` and
+`end()` would not correctly model `std::contiguous_iterator`. The cautious
+approach would be to disable `Read()` and `UncheckedRead()` if the backing
+storage is not `ContiguousBuffer`; readout to something like `std::string` could
+still be explicitly performed using the `begin()`/`end()` iterators.
+Alternately, for non-`ContiguousBuffer` backing storage, `Read()` could be
+explicitly limited to a small set of known-good types, such as `std::string` and
+`std::vector<char>`.
+
+
+#### Methods
+
+`Read()`, `UncheckedRead()`, `Write()`, and `UncheckedWrite()` should be defined
+as one would expect.
+
+`ToString()` should be an alias for `Read()`, to ease conversion from
+`UInt:8[]`.
+
+`CouldWriteValue()` should be defined as specified in the previous section.
+
+`Ok()` should return `true` if the string has storage (though it could be
+zero-length storage) and the bytes match the requirements (e.g., if a terminator
+or padding byte is required, `Ok()` should only return `true` if such a byte is
+present).
+
+`Size()` should return the (logical) length of the string in bytes.
+
+`MaxSize()` should return `BackingStorage().SizeInBytes()` or
+`BackingStorage().SizeInBytes() - 1` if the string requires a padding or
+terminator byte.
+
+`begin()`, `end()`, `rbegin()`, `rend()` should be defined as expected for a
+C++ container type.
+
+`operator[]` should return the value of a single byte at the specified offset.
+
+
+#### `emboss::String` Type
+
+(This section should not be considered particularly authoritative; the actual
+implementation could differ greatly if another strategy is turns out to be
+easier or less complex in practice.)
+
+Because values retrieved from the different string types can be used
+interchangeably at the expression layer (e.g., `let s = condition ? z_string :
+fix_string`), there must be a way for all views over strings to return a common
+type. This is complicated by two requirements:
+
+1. `emboss::String` should not allocate memory.
+2. `emboss::String` needs to handle backing storage that is not
+ `ContiguousBuffer`. It also needs to handle constant strings (`let x =
+ "string"`), and be able to assign `Storage`-based strings to constant
+ strings and vice versa.
+
+To satisfy the first requirement, `emboss::String` will need to hold a reference
+to the underlying storage, not actually copy bytes.
+
+One way to satisfy the second requirement would be to simply copy the string's
+bytes out to a new buffer, but that conflicts with the first requirement.
+Instead, it should be a sum type over a `Storage` type parameter and a constant
+string, like:
+
+```c++
+template <typename Storage>
+class String {
+ public:
+ String();
+ String(const char *data, int size);
+ String(Storage);
+ // ... operator= ...
+ int size() constexpr;
+ char operator[](int index) constexpr {
+ return storage_.Index() == 0 ? backports::Get<0>(storage_)[index]
+ : backports::Get<1>(storage_).data()[index];
+ }
+ // ... begin(), end(), etc. ...
+
+ private:
+ // TODO: replace backports::Variant with std::variant in 2027, when Emboss
+ // requires C++17.
+ backports::Variant<const char *, Storage> storage_;
+};
+```
+
+At least for now, `emboss::String` does not need to be exposed as a documented,
+supported API -- user code can use `Read<std::string_view>()` and similar
+operations as needed, with full knowledge of the underlying storage type.
+
+Comparisons and assignments between `emboss::String`s with different `Storage`
+type parameters do not need to be supported, since they cannot be generated by
+the code generator -- C++ codegen would only need those operations for
+`emboss::String`s that are derived from the same parent structure.
+
+
+### Handling in Other Languages
+
+C++ is unusual in that it does not differentiate at a language level between
+text strings and byte strings. Most other languages have different types for
+byte strings and text strings.
+
+For all languages that differentiate, Emboss strings should be treated as byte
+strings or byte arrays (Python3 `bytes`, Rust `Vec<u8>`, Proto `bytes`, etc.)
+
+Other than this caveat, Emboss string support should be straightforward in other
+languages.
+
+
+### Text Format
+
+Text format output should use the `"quoted string"` style. Byte values outside
+the range 32 through 126 should be emitted as escapes. Values with standard
+shorthand escapes (10 => `'\n'`, 0 => `'\0'`, etc.) should be emitted as such.
+For other values, hex escapes with exactly two digits (e.g., `\x06`, not `\x6`)
+should be emitted. It may be desirable to allow some `[text_format]` control
+over the output in the future.
+
+Text format input should allow both `"quoted string"` and list-of-bytes styles,
+with exactly the same rules as string constants in an `.emb` file, except that
+bytes > 126 might be allowed in a `"quoted string"`.
+
+
+### Expressions
+
+#### Type System Changes
+
+In order to facilitate `[requires]` on string types, the new types should have a
+new 'string' expression type.
+
+
+#### Runtime Representation
+
+In this proposal, no string manipulation are allowed, so temporary strings
+(which might require memory allocation) will not be necessary.
+
+
+#### String Attribute Representation
+
+Attributes values are currently represented by a special `AttributeValue` type
+which can hold either an `Expression` or a `String`. With a string expression
+type, `AttributeValue` can be replaced by a plain `Expression`. This will
+require changes to everything that touches `AttributeValue`.
+
+Alternately, `AttributeValue` could be left in the IR with only `Expression`,
+in which case only code that touches string attributes (`[byte_order]` and
+`[(cpp) namespace]`) needs to change.
+
+
+#### String Comparisons
+
+Comparison operations (`==`, `<`, `>`, `>=`, `<=`, `!=`) should be allowed,
+since these can be handled by passing references to existing memory.
+
+Equality and inequality (`==` and `!=`) should be defined in the expected way:
+two strings are equal iff they are the same length and the corresponding bytes
+in each string have the same value, and they are unequal if they are not equal.
+
+For ordering, strings should be compared lexically, using the binary value of
+each byte, with no regard for semantic collation. That is, `"Z" < "a"`, since
+`'Z'` is 90 and `'a'` is 97.
+
+When one string is a strict prefix of another string, the shorter string should
+be "less than" the longer; e.g., `"abc" < "abcdef"`. This is the same as the
+natural ordering for zero-terminated strings.
+
+
+#### Future String Operations
+
+It may be desirable, at some future point, to allow various string
+manipulations, such as concatenation or repetition, at least for compile-time
+strings.
+
+A substring operation should be possible without requiring memory allocation.
+
+Indexing into a string (`str[offset]`) should be allowed if/when indexing into
+an array is finally supported.
+
+
+### Arrays of Strings
+
+In some cases, it may be desirable to have an array of strings, like:
+
+```
+struct Foo:
+ 0 [+100] ZString[10] list
+```
+
+Although somewhat awkward, the existing explicit-length syntax should work:
+
+```
+struct Foo:
+ 0 [+100] ZString:80[10] list # 10 10-byte (80-bit) strings
+```