blob: b54645978950417c15ceb37f29233b1a729d5726 [file] [log] [blame] [view]
# Design Sketch: Integer Division and Modulus Operators
## Overview
Currently, Emboss does not support division, which limits what it can express
in its expression language. In particular, it is awkward to handle things like
arrays of 16-bit values whose size is specified in bytes.
The reason Emboss does not yet have support for division is that division is
tricky to handle correctly. This document attempts to explain the pitfalls and
the proper solutions.
Note that Emboss does not support non-integer arithmetic, so this document only
talks about *integer* division.
## Syntax
### Symbol
Many programming languages use `/` for all division operations, whether they
are "integer" division, floating point division, rational arithmetic division,
matrix division, or anything else.
Very, very few languages (Haskell) use an operator other than `%` for
modulus/remainder, and no popular language overloads `%` with a second meaning.
(Note that Haskell defines `%` as *rational number division*. `mod` and `rem`
are used for modulus and remainder, respectively, and `div` and `quot` are used
for truncating and flooring division.)
`%` is technically modulus in some languages, and remainder in others, but in
all cases the equality `x == (x / n * n) + x % n` holds, and does not seem to
confuse programmers.
#### C, C++, C#, Java, Go, Python 2, Rust, SQL, ...
All of these languages use `/` for multiple kinds of division, including
integer division.
Python 2 deprecated `/` for integer division; in a Python 2 program with `from
__future__ import division`, `/` is floating-point division.
C++ and Python both allow `/` to be overloaded for user-defined types.
#### JavaScript, TypeScript, Perl 5
These languages use `/` for floating-point division, and do not directly
support integer division; both operands and result must be cast to int:
((l | 0) / (r | 0)) | 0 // JS, TS
int(int($l) / int($r)) # Perl
#### OCaml, Haskell, Python 3
These languages use separate operators for integer and non-integer division.
Integer division:
l `div` r -- Haskell
l / r (* OCaml *)
l // r # Python 3
Non-integer division:
l / r -- Haskell
l /. r (* OCaml *)
l / r # Python 3
#### Emboss
Emboss should use `//` for division, following Python 3's example.
This is the only unambiguous integer division operator used by a TIOBE top-20
language, *and* Python 3's `//` has the same division semantics (flooring
division) as Emboss -- most other programming languages use truncating
(round-toward-0) division.
If Emboss ever gets support for floating-point arithmetic, it should probably
use OCaml's `/.` operator for division on floats.
Because `/` does different things in different popular languages (flooring
integer division, truncating integer division, floating-point division, or
multiple types, depending on operand types), Emboss should not use it.
Emboss should use `%` for modulus, following the example of almost every other
language.
### Precedence
In most programming languages, division is a left-associative operator with
equal precedence to multiplication; i.e. this:
a * b / c * d
is equivalent to this:
((a * b) / c) * d
However, a nontrivial percentage of programmers tend to read it as:
(a * b) / (c * d)
This may be because, after grade school, most division is expressed in fraction
form, like:
a * b
-----
c * d
Interestingly, fewer programmers seem to mis-read extra division:
a * b / c / d
is usually (correctly) parsed as:
((a * b) / c) / d
Given that confusion, in Emboss, the construct:
a * b / c * d
should be a syntax error.
The constructs:
a * b / c / d
a * b / c + d
could be syntax errors, although the second one (`... + d`), in particular,
seems to get parsed correctly by all programmers in my polling.
This is in keeping with two general rules of Emboss design:
1. It is better to lock things down at first and gradually ease restrictions
than the other way around.
2. Emboss syntax should be as unambiguous as possible, even when it makes
implementation trickier or `.emb` files slightly wordier. (See, for
example, the "separate-but-equal" precedence of the `&&` and `||`
operators in Emboss.)
## Semantics
### Flooring Division vs Truncating Division
Roughly speaking, there are two common forms of "integer division:" *flooring*
(sometimes called *Euclidean*) and *truncating* (sometimes called
*round-toward-0*) division.
| Operation | Flooring Result | Truncating Result |
| ---------- | ---------------- | ----------------- |
| +8 / +3 | 2 | 2 |
| +8 / -3 | -3 | -2 |
| -8 / +3 | -3 | -2 |
| -8 / -3 | -3 | 2 |
| +8 % +3 | 2 | 2 |
| +8 % -3 | -1 | 2 |
| -8 % +3 | 1 | -2 |
| -8 % -3 | -2 | -2 |
Most programming languages and most CPUs implement truncating division,
[however, truncating division is irregular around
0](https://dl.acm.org/doi/pdf/10.1145/128861.128862), which prevents some kinds
of expression rewrites (e.g., `(x + 6) / 3` is not the same as `x / 3 + 2` when
`x == -4`) and would force Emboss's bounds tracking to have wider bounds in
some common cases, such as modulus by a known constant, where the dividend could
be either positive or negative.
Emboss should use flooring division.
### Undefined Results
Division is the first operation in Emboss's expression language which can have
an undefined result: all other operations are total.
Emboss does have some notion of "invalid value," which it uses at runtime to
handle fields whose backing store does not exist or whose existence condition is
false, however, "invalid value" is not fully propagated in the expression type
system, and it has a somewhat different meaning than "undefined."
Currently, integer types in the Emboss expression language are sets of the
form:
{x ∈ ℤ | (min ≤ x ∨ min = -infinity) ∧
(x ≤ max ∨ max = infinity) ∧
x MOD m = n ∧
0 ≤ n < m ∧
min ∈ ℤ ∪ {-infinity} ∧
max ∈ ℤ ∪ {infinity} ∧
m ∈ ℤ ∪ {infinity} ∧
n ∈ ℤ}
where `min`, `max`, `m`, and `n` are parameters. The special values `infinity`
and `-infinity` for `max` and `min` indicate that there is no known upper or
lower bound, and the special value `infinity` for `m` indicates that `x` has a
constant value equal to `n`.
This will have to change to something like:
{x ∈ ℤ ∪ {undefined} | ((can_be_undefined ∧ x = undefined) ∨
(min ≤ x ≤ max ∧ x MOD m = n)) ∧
0 ≤ n < m ∧
can_be_undefined ∈ 𝔹 ∧
min ∈ ℤ ∪ {-infinity, undefined} ∧
max ∈ ℤ ∪ {infinity, undefined} ∧
m ∈ ℤ ∪ {infinity} ∧
n ∈ ℤ ∪ {undefined}}
where `can_be_undefined` is a new boolean parameter.
This also leaks out into boolean types, which are currently either {true,
false}, {true}, or {false}: they can now be {true, false}, {true}, {false},
{true, false, undefined}, {true, undefined}, {false, undefined}, or
{undefined}.
### Expression Bound Calculations
#### Division
For an expression x of the form `l // r`, Emboss must figure out the five
parameters of the type of x (`x_min`, `x_max`, `x_m`, `x_n`,
`x_can_be_undefined`) from the five parameters of each of the types of `l` and
`r`.
x_can_be_undefined = l_can_be_undefined ∨
r_can_be_undefined ∨
r_min ≤ 0 ≤ r_max
That is, `x` can be undefined if either input is undefined, or if there could
be division by zero.
If `r_min = 0 = r_max`, then:
x_max = undefined
x_min = undefined
x_m = infinity
x_n = undefined
Otherwise, the remaining parameters can be computed by parts, although for
`x_max` and `x_min` it is simpler to simply check all pairs of {`l_min`, `-1`,
`1`, `l_max`} // {`r_min`, `-1`, `1`, `r_max`} (removing zeroes from the
right-hand side, and removing `-1` and `1` if they do not fall between the
`min` and `max`).
`x_m` is:
infinity if l_m = r_m = infinity else
GCD(l_m, r_n) if r_m = infinity else
1
`x_n` is
l_n // r_n if l_m = r_m = infinity else
(l_n // r_n) MOD GCD(l_m, r_n) if r_m = infinity else
0
#### Modulus
Modulus is somewhat simpler.
x_can_be_undefined = l_can_be_undefined ∨
r_can_be_undefined ∨
r_min ≤ 0 ≤ r_max
If `l_m = r_m = infinity`, `x_m = infinity` and `x_min = x_max = x_n = l_n %
r_n`.
Otherwise:
x_max = max(0, r_max - 1)
x_min = min(0, r_min + 1)
x_m = 1
x_n = 0
Note that, as with some other bounds calculations in Emboss, there are some
special cases where `l` or `r` is a very small set where it is technically
possible to find a narrower bound: for example, if 6 ≤ l ≤ 7, l % 4 could have
the bounds `x_max = 3` and `x_min = 2`. However, Emboss does not need to find
the absolute tightest bound; it only needs to find a reasonable bound.
## Other Notes
### `IsComplete()`
The *intention* of `IsComplete()` is that it should return `true` iff adding
more bytes cannot change the result of `Ok()` -- that is, iff the structure is
"complete" in the sense that there are enough bytes to hold the structure, even
if the structure is broken in some other way.
If the size of the structure is a dynamic value which involves division by
zero, the second definition of `IsComplete()` becomes ill-defined, in that it
becomes impossible to know if there are enough bytes for the structure:
struct Foo:
0 [+2] UInt divisor
2 [+512 // divisor] UInt:8[] payload
If `divisor == 0`, the size of `payload` is undefined, so the structure's
completeness is undefined.
On the other hand, if `divisor == 0`, then the structure can never be `Ok()`,
so in that sense it is 'complete': there is no way to add more bytes and get a
structure that is functional. There may be another way for the client software
to recover, such as scanning an input stream for a new start-of-message marker.
I (bolms@) think that the best option is:
1. Add a new method `Maybe<bool> IsUndefined()` to virtual field views (and
maybe to `UIntView`, `IntView`, and `FlagView`).
2. If `IntrinsicSizeInBytes().IsUndefined()`, `IsComplete()` should return
`true`. (Note that `IntrinsicSizeInBytes()` is just the way the
`$size_in_bytes` implicit virtual field is spelled in C++.)
* This will require a bunch of plumbing into the whole expression
generation logic -- probably changing `Maybe<T>` to have a 'why not'
reason when there is no `T`.
3. Otherwise, proceed with the current `IsComplete()` logic.
4. Update the documentation for `IntrinsicSizeInBytes()` and `IsComplete()` to
note this caveat.
In practice, it is rare to have a dynamic value that could be zero as a divisor
in a field position or length: such divisions are almost always either division
by a constant or division by one of a small set of known divisors, often powers
of 2 (and in many of those cases, a shift operation would fit better).