Editions: Group Migration Issues

Summary

Address some unexpected issues in delimited encoding in edition 2023 before its OSS release.

Background

Joshua Humphries reported some well-timed issues discovered while experimenting with our early release of Edition 2023. He discovered that our new message encoding feature piggybacked a bit too much on the old group logic, and actually ended up being virtually useless in general.

None of our testing or migrations caught this because they were heavily focused on preserving old behavior (which is the primary goal of edition 2023). Delimited messages structured exactly like proto2 groups (e.g. message and field in the same scope with matching names) continued to work exactly as before, making it seem like everything was fine.

All of this is especially problematic in light of Submessages: In Pursuit of a More Perfect Encoding (not available externally yet), which intends to migrate the ecosystem to use delimited encoding everywhere. Releasing a semi-broken feature as a migration tool to eliminate a deprecated syntax is one thing, but trying to push the ecosystem to it is especially bad.

Overview

The problems here stem from the fact that before edition 2023, the field and type name of group fields was guaranteed to always be unique and intuitive. Proto2 splits groups into a synthetic nested message with a type name equivalent to the group specification (required to be capitalized), and a field name that's fully lowercased. For example,

optional group MyGroup = 1 { ... }

would become:

message MyGroup { ... }
optional MyGroup mygroup = 1;

The casing here is very important, since the transformation is irreversible. We can't recover the group name from the field name in general, only if the group is a single word.

The problem under edition 2023 is that we've removed the generation of synchronized synthetic messages from the language. Users now explicitly define messages, and any message field can be marked DELIMITED. This means that anyone assuming that the type and field name are synchronized could now be broken.

Codegen

While using the field name for generated APIs required less special-casing in the generators, the field name ends up producing slightly-less-readable APIs for multi-word camelcased groups. The result is that we see a fairly random-seeming mix in different generators. Using protoc-explorer (not available externally), we find the following:

* This codegen difference was caught during the implementation and intentionally “fixed” in Edition 2023.
** This includes all upb-based runtimes as well (e.g. Ruby, Rust, etc.)
† Extensions use field

In the Dart V1 implementation, we decided to intentionally introduce a behavior change on editions upgrades. It was determined that this only affected a handful of protos in google3, and could probably be manually fixed as-needed. Java‘s handling changes the story significantly, since over 50% of protos in google3 produce generated Java code. Objective-C is also noteworthy since we open-source it, and Swift because it’s widely used in OSS and we don't own it.

While the editions upgrade is still non-breaking, it means that the generated APIs could have very surprising spellings and may not be unique. For example, using the same type for two delimited fields in the same containing message will create two sets of generated APIs with the same name in some languages!

Text Format

Our “official” draft specification of text-format explicitly states that group messages are encoded by the message name, rather than the lowercases field name. A group MyGroup will be serialized as:

MyGroup {
  ...
}

In C++, we always serialize the message name and have special handling to only accept the message name in parsing. We also have conformance tests locking down the positive path here (i.e. using the message name round-trip). The negative path (i.e. failing to accept the field name) doesn‘t have a conformance test, but C++/Java/Python all agree and there’s no known case that doesn't.

To make things even stranger, for extensions (group fields extending other messages), we always use the field name for groups. So as far as group extensions are concerned, there's no problem for editions.

There are a few problems with non-extension group fields in editions:

Refactoring the message name will change any text-format output
New delimited fields will have unexpected text-format output, that could conflict with other fields
Text parsers will expect the message name, which is surprising and could be impossible to specify uniquely

Recommendation

Clearly the end-state we want is for the field name to be used in all generated APIs, and for text-format serialization/parsing. The only questions are: how do we get there and can/should we do it in time for the 2023 release in 27.0 next month?

We propose a combination of the alternatives listed below. Smooth Extension seems like the best short-term path forward to unblock the delimited migration. It mostly solves the problem and doesn't require any new features. The necessary changes for this approach have already been prepared, along with new conformance tests to lock down the behavior changes.

Global Feature is a good long-term mitigation for tech debt we're leaving behind with Smooth Extension. Ultimately we would like to remove any labeling of fields by their type, and editions provides a good mechanism to do this. Alternatively, we could implement aliases and use that to unify this old behavior and avoid a new feature. Either of these options will be the next step after the release of 2023, with aliases being preferred as long as the timing works out.

If we hit any unexpected delays, Nerf Delimited Encoding in 2023 (not available externally) is the quickest path forward to unblock the release of edition 2023. It has a lot of downsides though, and will block any migration towards delimited encoding until edition 2024 has started rolling out.

Alternatives

Smooth Extension

Instead of trying to change the existing behavior, we could expand the current spec to try to cover both proto2 and editions. We would define a “group-like” concept, which applies to all fields which:

Have DELIMITED encoding
Have a type corresponding to a nested message directly under its containing message
Have a name corresponding to its lowercased type name.

Note that proto2 groups will always be “group-like.”

For any group-like field we will use the old proto2 semantics, whatever they are today. Otherwise, we will treat them as regular fields for both codegen and text-format. This means that most new cases of delimited encoding will have the desired behavior, while all old groups will continue to function. The main exception here is that users will see the unexpected proto2 behavior if they have message/field names that happen to match.

While the old behavior will result in some unexpected capitalization when it‘s hit, it’s mostly safe. Because of 2 and 3 (and the fact that we disallow duplicate field names), we can guarantee that in both codegen and text encoding there will never be any conflicting symbols. There can never be two delimited fields of the same type using the old behavior, and no other messages or fields will exist with either spelling.

Additionally, we will update the text parsers to accept both the old message-based spelling and the new field-based spelling for group-like fields. This will at least prevent parsing failures if users hit this unexpected change in behavior.

Pros

Fully supports old proto2 behavior
Treats most new editions fields correctly
Doesn't allow for any of the problematic cases we see today
By updating the parsers to accept both, we have a migration path to change the “wire”-format
Decoupled from editions launch (since it's a non-breaking change w/o a feature)

Cons

Requires coordinated changes in every editions-compatible runtime (and many generators)
Keeps the old proto2 behavior around indefinitely, with no path to remove it
Plants surprising edge case for users if they happen to name their message/fields a certain way

Global Feature

The simplest answer here is to introduce a new global message feature legacy_group_handling to control all the changes we'd like. This will only be applicable to group-like fields (see Smooth Extension). With this feature enabled, these fields will always use their message name for text-format. Each non-conformant language could also use this feature to gate the codegen rules.

Pros

Simple boolean to gate all the behavior changes
Doesn‘t require adding language features to a bunch of languages that don’t have them yet
Uses editions to ratchet down the bad behavior

Cons

It's a little late in the game to be introducing new features to 2023 (go/edition-lifetimes)
Requires coordinated changes in every editions-compatible runtime (and many generators)
The migration story for users is unclear. Overriding the value of this feature is both a “wire”-breaking and API-breaking change they may not be able to do easily.
With the feature set, users will still see all of the problems we have today

Feature Suite

An extension of Global feature would be to split the codegen changes out into separate per-language features.

Pros

Simple booleans to gate all the distinct behavior changes
Uses editions to ratchet down the bad behavior
Better migration story for users, since it separates API and “wire” breaking changes

Cons

Requires a whole slew of new language features, which typically have a difficult first-time setup
Requires coordinated changes in every editions-compatible runtime (and many generators)
Increases the complexity of edition 2023 significantly
With the features set, users will still see all of the problems we have today

Nerf Delimited Encoding in 2023

A quick fix to avoid releasing a bad feature would be to simply ban the case where the message and field names don't match. Adding this validation to protoc would cover the majority of cases, although we might want additional checks in every language that supports dynamic messages.

This is a good fallback option if we can't implement anything better before 27.0 is released. It allows us to release editions in a reasonable state, where we can fix these issues and release a more functional DELIMITED feature in 2024.

Pros

Unblocks editions rollout
Easy and safe to implement
Avoids rushed implementation of a proper fix
Avoids runtime issues with text format
Avoids unexpected build breakages post-editions (e.g. renaming the nested message)

Cons

We‘d still be releasing a really bad feature. Instead of opening up new possibilities, it’s just “like groups but worse”
We couldn‘t fix this in 2023 without potential version skew from third party plugins. We’d likely have to wait until edition 2024
Might requires coordinated changes in a lot of runtimes
Doesn't unblock our effort to roll out delimited

Rename Fields in Editions

While it might be tempting to leverage the edition 2023 upgrade as a place we can just rename the group field, that doesn't actually work (e.g. rename mygroup to my_group). Because so many runtimes already use the field name in generated APIs, they would break under this transformation.

Pros

Works really well for text-format and some languages

Cons

Turns 2023 upgrade into a breaking change for many languages

Aliases

We‘ve discussed aliases a lot mostly in the context of Any, but they would be useful for any encoding scheme that locks down field/message names. If we had a fully implemented alias system in place, it would be the perfect mitigation here. Unfortunately, we don’t yet and the timeline here is probably too tight to implement one.

Pros

Fixes all of the problems mentioned above
Allows us to specify the old behavior using the proto language, which allows it to be handled by Prototiller

Cons

We want this to be a real fully thought-out feature, not a hack rushed into a tight timeline

Do Nothing

Doing nothing doesn't actually break anyone, but it is embarrassing.

Pros

Easy to do

Cons

Releases a horrible feature full of foot-guns in our first edition
Doesn't unblock our effort to roll out delimited