Stricter Schemas with Editions

Author: @mcy

Approved: 2022-11-28

Overview

The Protobuf language is surprisingly lax in what it allows in some places, even though these corners of the syntax space are rarely exercised in real use, and which add complexity to backends and runtimes.

This document describes several such corners in the language, and how we might use Editions to fix them (spoiler: we'll add a feature for each one and then ratchet the features).

This is primarily a memo on a use-case for Editions, and not a design doc per se.

Potential Lints

Entity Names

Protobuf does not enforce any constraints on names other than the “ASCII identifier” rule: they must match the regex [A-Za-z_][A-Za-z0-9_]*. This results in problems for backends:

  • Backends need to be able to convert between PascalCase, camelCase, snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky.
  • Extraneous underscores, such as underscores in names that want to be PascalCase, trailing underscores, leading underscores, and repeated underscores create problems for case conversion and can clash with private names generated by backends.
  • Protobuf does not support non-ASCII identifiers, mostly out of inertia more than anything else. Because some languages (Java most prominent among them) do not support them, we can never support them, but we are not particularly clear on this point.

The Protobuf language should be as strict as possible in what patterns it accepts for identifiers, since these need to be transformed to many languages. Thus, we propose the following regexes for the three casings used in Protobuf:

  • ([A-Z][a-zA-Z0-9]*)+ for PascalCase. We require this case for:
    • Messages.
    • Enums.
    • Services.
    • Methods.
  • [a-z][a-z0-9]*(_[a-z0-9]+)* for snake_case. We require this case for:
    • Fields (including extensions).
    • Package components.
  • [A-Z][A-Z0-9]*(_[A-Z0-9]+)* for SHOUTY_CASE. We require this case for:
    • Enum values.

These patterns are intended to reject extraneous underscores, and to make casing of ASCII letters consistent. We explicitly only support ASCII for maximal portability to target languages. Note that option names are not included, since those are defined as fields in a proto, and would be subject to this rule automatically.

To migrate, we would introduce a bool feature feature.relax_identifier_rules, which can be applied to any entity. When set, it would cause the compiler to reject .proto files which contain identifiers that don't match the above constraints. It would default to true and would switch to false in a future edition.

Keywords as Identifiers

Currently, the Protobuf language allows using keywords as identifiers. This makes the parser somewhat more complicated than it has to be for minimal benefit, and shadowing behavior is not well-specified. For example, what does the following compile as?

message Foo {
  message int32 {}
  optional int32 foo = 1;
}

This is particularly fraught in places where either a keyword or a type name can follow. For example, optional foo = 1; is a proto3 non-optional with type optional, but the parser can't tell until it sees the = sign.

To avoid this and eventually stop supporting this in the parser, we make the following set of keywords true reserved names that cannot be used as identifiers:

bool      bytes    double  edition   enum      extend    extensions  fixed32
fixed64   float    group   import    int32     int64     map         max
message   oneof    option  optional  package   public    repeated    required
reserved  returns  rpc     service   sfixed32  sfixed64  sint32      sint64
stream    string   syntax  to        uint32    uint64    weak

Additionally, we introduce the syntax #optional for escaping a keyword as an identifier. This may only be used on keywords, and not non-keyword identifiers.

To migrate, we would introduce a bool feature feature.keywords_as_identifiers, which can be applied to any entity. When set, it would cause the compiler to reject .proto files which contain identifiers that use the names of keywords. It would migrate true->false in a future edition. The #optional syntax would not need to be feature-gated.

From time to time we may introduce new keywords. The best procedure for doing so is to add a feature.xxx_is_a_keyword feature, start it out as true, and then switch it to false in an edition, which would cause it to be treated as a keyword for the purposes of this check. There's nothing stopping us from starting to use it in the syntax without an edition if it would be relatively unambiguous (i.e., a “contextual” keyword). Rust provides guidance here: they really hate contextual keywords since it complicates the parser, so keywords start out as contextual and become properly reserved in the next Rust edition.

Nonempty Package

Right now, an empty package is technically permitted. We should remove this functionality from the language completely and require every file to declare a package.

We would introduce a feature like feature.allow_missing_package, start it out as true, and switch it to false.

Invalid Names in reserved

Currently, reserved "foo-bar"; is accepted. It is not a valid name for a field and thus should be rejected. Ideally we should remove this syntax altogether and only permit the use of identifiers in this position, such as reserved foo, bar;.

We would introduce a feature like feature.allow_strings_in_reserved, start it out as true, and then switch it to false.

Almost All Names are Fully Qualified

Right now, Protobuf defines a complicated name resolution scheme that involves matching subsets of names inspired by that of C++ (which is even more complicated than ours!). Instead, we should require that every name be either a single identifier OR fully-qualified. This is an attempt to move to Go-style name resolution, which is significantly simpler to implement and explain.

In particular, if a name is a single identifier, then:

  • It must be the name of a type defined at the top level of the current file.
  • If it is the name of a message or enum for a field's type, it may be the name of a type defined in the current message. This does not apply to extension fields.

Because any multi-component path must be fully qualified, we no longer need the .foo.Bar syntax anymore, except to refer to messages defined in files without a package. We forbid .-prefixed names except in that case.

We would introduce a feature like features.use_cpp_style_name_resolution, start it out as true, and then switch it to false.

Ideally, if we get strict identifier names, we can tell that Foo.Bar is rooted at a message, rather than a package. In that case, we could go as far as saying that “names that start with a lower-case letter are fully-qualified, otherwise they are relative to the current package, and will only find things defined in the current file.”

Unlike Go, we do not allow finding things in other packages without being fully-qualified; this mostly comes from doing source-diving in very large packages, like the Go runtime, where it is very hard to find where something is defined.

Unique Enum Values

Right now, we allow aliases in enums:

enum Foo {
  BAR = 5;
  BAZ = 5;
}

This results in significant complexity in some parts of the backend, and weird behavior in textproto and JSON. We should disallow this.

We would introduce a feature like features.allow_enum_aliases, which would switch from true to false.

Imports are Used

We should adopt the Go rule that all non-public imports are used (i.e, every import provides at least one type referred to in the file).

We would introduce a feature like features.allow_unused_imports, which would switch from true to false.

Next Field # is Reserved

There's a few idioms for this checked by linters, such as // Next ID: N. We should codify this in the language by rewriting that every message begin with reserved N to max;, with the intent that N is the next never-used field number. Because it is required to be the first production in the message, it can be

We could, additionally, require that every field number be either used or reserved, in addition to having a single N to max; reservation. Alternatively, we could require that every field number up to the largest one used be reserved; gaps between message numbers are usually a smell.

This applies equally to message fields and enum values.

We would introduce a feature like features.allow_unused_numbers, which we would switch from true to false.

Disallow Implicit String Concatenation

Protobuf will implicitly concatenate two adjacent strings in any place it allows quoted strings, e.g. option foo = "bar " "baz;. This has caused interesting problems around reserved in the past, if a comma is omitted: reserved "foo" "bar"; is reserved "foobar";.

We would introduce a feature like features.concatenate_adjacent_strings, which would switch from true to false.

Package Is First

The package declaration can appear anywhere in the file after syntax or edition. We should take cues from Go and require it to be the first thing in the file, after the edition.

We would introduce a feature like features.package_anywhere, which would switch from true to false.

Strict Boolean Options

Boolean options can use true, false, True, False, T, or F as a value: option my_bool = T;. We should restrict to only true and false.

We would introduce a feature like features.loose_bool_options, which would switch from true to false.

Decimal Field Numbers

We permit non-decimal integer literals for field numbers, e.g. optional int32 x = 0x01;. Thankfully(?) we do not already permit a leading + or -. We should require decimal literals, since there is very little reason to allow other literals and makes the Protobuf language harder to parse.

We would introduce a feature like features.non_decimal_field_numbers, which would switch from true to false.