Merge pull request #74 from reventlov/division_doc

Add design sketch for division operator.
diff --git a/doc/design_docs/division_and_modulus.md b/doc/design_docs/division_and_modulus.md
new file mode 100644
index 0000000..b546459
--- /dev/null
+++ b/doc/design_docs/division_and_modulus.md
@@ -0,0 +1,334 @@
+# Design Sketch: Integer Division and Modulus Operators
+
+## Overview
+
+Currently, Emboss does not support division, which limits what it can express
+in its expression language.  In particular, it is awkward to handle things like
+arrays of 16-bit values whose size is specified in bytes.
+
+The reason Emboss does not yet have support for division is that division is
+tricky to handle correctly.  This document attempts to explain the pitfalls and
+the proper solutions.
+
+Note that Emboss does not support non-integer arithmetic, so this document only
+talks about *integer* division.
+
+
+## Syntax
+
+### Symbol
+
+Many programming languages use `/` for all division operations, whether they
+are "integer" division, floating point division, rational arithmetic division,
+matrix division, or anything else.
+
+Very, very few languages (Haskell) use an operator other than `%` for
+modulus/remainder, and no popular language overloads `%` with a second meaning.
+
+(Note that Haskell defines `%` as *rational number division*.  `mod` and `rem`
+are used for modulus and remainder, respectively, and `div` and `quot` are used
+for truncating and flooring division.)
+
+`%` is technically modulus in some languages, and remainder in others, but in
+all cases the equality `x == (x / n * n) + x % n` holds, and does not seem to
+confuse programmers.
+
+
+#### C, C++, C#, Java, Go, Python 2, Rust, SQL, ...
+
+All of these languages use `/` for multiple kinds of division, including
+integer division.
+
+Python 2 deprecated `/` for integer division; in a Python 2 program with `from
+__future__ import division`, `/` is floating-point division.
+
+C++ and Python both allow `/` to be overloaded for user-defined types.
+
+
+#### JavaScript, TypeScript, Perl 5
+
+These languages use `/` for floating-point division, and do not directly
+support integer division; both operands and result must be cast to int:
+
+    ((l | 0) / (r | 0)) | 0  // JS, TS
+
+    int(int($l) / int($r))    # Perl
+
+
+#### OCaml, Haskell, Python 3
+
+These languages use separate operators for integer and non-integer division.
+
+Integer division:
+
+    l `div` r    -- Haskell
+    l / r        (* OCaml *)
+    l // r        # Python 3
+
+Non-integer division:
+
+    l / r        -- Haskell
+    l /. r       (* OCaml *)
+    l / r         # Python 3
+
+
+#### Emboss
+
+Emboss should use `//` for division, following Python 3's example.
+
+This is the only unambiguous integer division operator used by a TIOBE top-20
+language, *and* Python 3's `//` has the same division semantics (flooring
+division) as Emboss -- most other programming languages use truncating
+(round-toward-0) division.
+
+If Emboss ever gets support for floating-point arithmetic, it should probably
+use OCaml's `/.` operator for division on floats.
+
+Because `/` does different things in different popular languages (flooring
+integer division, truncating integer division, floating-point division, or
+multiple types, depending on operand types), Emboss should not use it.
+
+Emboss should use `%` for modulus, following the example of almost every other
+language.
+
+
+### Precedence
+
+In most programming languages, division is a left-associative operator with
+equal precedence to multiplication; i.e. this:
+
+    a * b / c * d
+
+is equivalent to this:
+
+    ((a * b) / c) * d
+
+However, a nontrivial percentage of programmers tend to read it as:
+
+    (a * b) / (c * d)
+
+This may be because, after grade school, most division is expressed in fraction
+form, like:
+
+    a * b
+    -----
+    c * d
+
+Interestingly, fewer programmers seem to mis-read extra division:
+
+    a * b / c / d
+
+is usually (correctly) parsed as:
+
+    ((a * b) / c) / d
+
+Given that confusion, in Emboss, the construct:
+
+    a * b / c * d
+
+should be a syntax error.
+
+The constructs:
+
+    a * b / c / d
+    a * b / c + d
+
+could be syntax errors, although the second one (`... + d`), in particular,
+seems to get parsed correctly by all programmers in my polling.
+
+This is in keeping with two general rules of Emboss design:
+
+1.  It is better to lock things down at first and gradually ease restrictions
+    than the other way around.
+2.  Emboss syntax should be as unambiguous as possible, even when it makes
+    implementation trickier or `.emb` files slightly wordier.  (See, for
+    example, the "separate-but-equal" precedence of the `&&` and `||`
+    operators in Emboss.)
+
+
+## Semantics
+
+### Flooring Division vs Truncating Division
+
+Roughly speaking, there are two common forms of "integer division:" *flooring*
+(sometimes called *Euclidean*) and *truncating* (sometimes called
+*round-toward-0*) division.
+
+| Operation  | Flooring Result  | Truncating Result |
+| ---------- | ---------------- | ----------------- |
+| +8 / +3    | 2                | 2                 |
+| +8 / -3    | -3               | -2                |
+| -8 / +3    | -3               | -2                |
+| -8 / -3    | -3               | 2                 |
+| +8 % +3    | 2                | 2                 |
+| +8 % -3    | -1               | 2                 |
+| -8 % +3    | 1                | -2                |
+| -8 % -3    | -2               | -2                |
+
+Most programming languages and most CPUs implement truncating division,
+[however, truncating division is irregular around
+0](https://dl.acm.org/doi/pdf/10.1145/128861.128862), which prevents some kinds
+of expression rewrites (e.g., `(x + 6) / 3` is not the same as `x / 3 + 2` when
+`x == -4`) and would force Emboss's bounds tracking to have wider bounds in
+some common cases, such as modulus by a known constant, where the dividend could
+be either positive or negative.
+
+Emboss should use flooring division.
+
+
+### Undefined Results
+
+Division is the first operation in Emboss's expression language which can have
+an undefined result: all other operations are total.
+
+Emboss does have some notion of "invalid value," which it uses at runtime to
+handle fields whose backing store does not exist or whose existence condition is
+false, however, "invalid value" is not fully propagated in the expression type
+system, and it has a somewhat different meaning than "undefined."
+
+Currently, integer types in the Emboss expression language are sets of the
+form:
+
+    {x ∈ ℤ | (min ≤ x ∨ min = -infinity) ∧
+             (x ≤ max ∨ max = infinity) ∧
+             x MOD m = n ∧
+             0 ≤ n < m ∧
+             min ∈ ℤ ∪ {-infinity} ∧
+             max ∈ ℤ ∪ {infinity} ∧
+             m ∈ ℤ ∪ {infinity} ∧
+             n ∈ ℤ}
+
+where `min`, `max`, `m`, and `n` are parameters.  The special values `infinity`
+and `-infinity` for `max` and `min` indicate that there is no known upper or
+lower bound, and the special value `infinity` for `m` indicates that `x` has a
+constant value equal to `n`.
+
+This will have to change to something like:
+
+    {x ∈ ℤ ∪ {undefined} | ((can_be_undefined ∧ x = undefined) ∨
+                            (min ≤ x ≤ max ∧ x MOD m = n)) ∧
+                           0 ≤ n < m ∧
+                           can_be_undefined ∈ 𝔹 ∧
+                           min ∈ ℤ ∪ {-infinity, undefined} ∧
+                           max ∈ ℤ ∪ {infinity, undefined} ∧
+                           m ∈ ℤ ∪ {infinity} ∧
+                           n ∈ ℤ ∪ {undefined}}
+
+where `can_be_undefined` is a new boolean parameter.
+
+This also leaks out into boolean types, which are currently either {true,
+false}, {true}, or {false}: they can now be {true, false}, {true}, {false},
+{true, false, undefined}, {true, undefined}, {false, undefined}, or
+{undefined}.
+
+
+### Expression Bound Calculations
+
+#### Division
+
+For an expression x of the form `l // r`, Emboss must figure out the five
+parameters of the type of x (`x_min`, `x_max`, `x_m`, `x_n`,
+`x_can_be_undefined`) from the five parameters of each of the types of `l` and
+`r`.
+
+    x_can_be_undefined = l_can_be_undefined ∨
+                         r_can_be_undefined ∨
+                         r_min ≤ 0 ≤ r_max
+
+That is, `x` can be undefined if either input is undefined, or if there could
+be division by zero.
+
+If `r_min = 0 = r_max`, then:
+
+    x_max = undefined
+    x_min = undefined
+    x_m = infinity
+    x_n = undefined
+
+Otherwise, the remaining parameters can be computed by parts, although for
+`x_max` and `x_min` it is simpler to simply check all pairs of {`l_min`, `-1`,
+`1`, `l_max`} // {`r_min`, `-1`, `1`, `r_max`} (removing zeroes from the
+right-hand side, and removing `-1` and `1` if they do not fall between the
+`min` and `max`).
+
+`x_m` is:
+
+    infinity      if l_m = r_m = infinity else
+    GCD(l_m, r_n) if r_m = infinity       else
+    1
+
+`x_n` is
+
+    l_n // r_n                     if l_m = r_m = infinity else
+    (l_n // r_n) MOD GCD(l_m, r_n) if r_m = infinity       else
+    0
+
+
+#### Modulus
+
+Modulus is somewhat simpler.
+
+    x_can_be_undefined = l_can_be_undefined ∨
+                         r_can_be_undefined ∨
+                         r_min ≤ 0 ≤ r_max
+
+If `l_m = r_m = infinity`, `x_m = infinity` and `x_min = x_max = x_n = l_n %
+r_n`.
+
+Otherwise:
+
+    x_max = max(0, r_max - 1)
+    x_min = min(0, r_min + 1)
+    x_m = 1
+    x_n = 0
+
+Note that, as with some other bounds calculations in Emboss, there are some
+special cases where `l` or `r` is a very small set where it is technically
+possible to find a narrower bound: for example, if 6 ≤ l ≤ 7, l % 4 could have
+the bounds `x_max = 3` and `x_min = 2`.  However, Emboss does not need to find
+the absolute tightest bound; it only needs to find a reasonable bound.
+
+
+## Other Notes
+
+### `IsComplete()`
+
+The *intention* of `IsComplete()` is that it should return `true` iff adding
+more bytes cannot change the result of `Ok()` -- that is, iff the structure is
+"complete" in the sense that there are enough bytes to hold the structure, even
+if the structure is broken in some other way.
+
+If the size of the structure is a dynamic value which involves division by
+zero, the second definition of `IsComplete()` becomes ill-defined, in that it
+becomes impossible to know if there are enough bytes for the structure:
+
+    struct Foo:
+      0 [+2]               UInt      divisor
+      2 [+512 // divisor]  UInt:8[]  payload
+
+If `divisor == 0`, the size of `payload` is undefined, so the structure's
+completeness is undefined.
+
+On the other hand, if `divisor == 0`, then the structure can never be `Ok()`,
+so in that sense it is 'complete': there is no way to add more bytes and get a
+structure that is functional.  There may be another way for the client software
+to recover, such as scanning an input stream for a new start-of-message marker.
+
+I (bolms@) think that the best option is:
+
+1.  Add a new method `Maybe<bool> IsUndefined()` to virtual field views (and
+    maybe to `UIntView`, `IntView`, and `FlagView`).
+2.  If `IntrinsicSizeInBytes().IsUndefined()`, `IsComplete()` should return
+    `true`.  (Note that `IntrinsicSizeInBytes()` is just the way the
+    `$size_in_bytes` implicit virtual field is spelled in C++.)
+    *   This will require a bunch of plumbing into the whole expression
+        generation logic -- probably changing `Maybe<T>` to have a 'why not'
+        reason when there is no `T`.
+3.  Otherwise, proceed with the current `IsComplete()` logic.
+4.  Update the documentation for `IntrinsicSizeInBytes()` and `IsComplete()` to
+    note this caveat.
+
+In practice, it is rare to have a dynamic value that could be zero as a divisor
+in a field position or length: such divisions are almost always either division
+by a constant or division by one of a small set of known divisors, often powers
+of 2 (and in many of those cases, a shift operation would fit better).