Authors: @mkruskal-google
Address some unexpected issues in delimited encoding in edition 2023 before its OSS release.
Joshua Humphries reported some well-timed issues discovered while experimenting with our early release of Edition 2023. He discovered that our new message encoding feature piggybacked a bit too much on the old group logic, and actually ended up being virtually useless in general.
None of our testing or migrations caught this because they were heavily focused on preserving old behavior (which is the primary goal of edition 2023). Delimited messages structured exactly like proto2 groups (e.g. message and field in the same scope with matching names) continued to work exactly as before, making it seem like everything was fine.
All of this is especially problematic in light of Submessages: In Pursuit of a More Perfect Encoding (not available externally yet), which intends to migrate the ecosystem to use delimited encoding everywhere. Releasing a semi-broken feature as a migration tool to eliminate a deprecated syntax is one thing, but trying to push the ecosystem to it is especially bad.
The problems here stem from the fact that before edition 2023, the field and type name of group fields was guaranteed to always be unique and intuitive. Proto2 splits groups into a synthetic nested message with a type name equivalent to the group specification (required to be capitalized), and a field name that's fully lowercased. For example,
optional group MyGroup = 1 { ... }
would become:
message MyGroup { ... } optional MyGroup mygroup = 1;
The casing here is very important, since the transformation is irreversible. We can't recover the group name from the field name in general, only if the group is a single word.
The problem under edition 2023 is that we've removed the generation of synchronized synthetic messages from the language. Users now explicitly define messages, and any message field can be marked DELIMITED
. This means that anyone assuming that the type and field name are synchronized could now be broken.
While using the field name for generated APIs required less special-casing in the generators, the field name ends up producing slightly-less-readable APIs for multi-word camelcased groups. The result is that we see a fairly random-seeming mix in different generators. Using protoc-explorer (not available externally), we find the following:
* This codegen difference was caught during the implementation and intentionally “fixed” in Edition 2023.
** This includes all upb-based runtimes as well (e.g. Ruby, Rust, etc.)
† Extensions use field
In the Dart V1 implementation, we decided to intentionally introduce a behavior change on editions upgrades. It was determined that this only affected a handful of protos in google3, and could probably be manually fixed as-needed. Java‘s handling changes the story significantly, since over 50% of protos in google3 produce generated Java code. Objective-C is also noteworthy since we open-source it, and Swift because it’s widely used in OSS and we don't own it.
While the editions upgrade is still non-breaking, it means that the generated APIs could have very surprising spellings and may not be unique. For example, using the same type for two delimited fields in the same containing message will create two sets of generated APIs with the same name in some languages!
Our “official” draft specification of text-format explicitly states that group messages are encoded by the message name, rather than the lowercases field name. A group MyGroup
will be serialized as:
MyGroup { ... }
In C++, we always serialize the message name and have special handling to only accept the message name in parsing. We also have conformance tests locking down the positive path here (i.e. using the message name round-trip). The negative path (i.e. failing to accept the field name) doesn‘t have a conformance test, but C++/Java/Python all agree and there’s no known case that doesn't.
To make things even stranger, for extensions (group fields extending other messages), we always use the field name for groups. So as far as group extensions are concerned, there's no problem for editions.
There are a few problems with non-extension group fields in editions:
Clearly the end-state we want is for the field name to be used in all generated APIs, and for text-format serialization/parsing. The only questions are: how do we get there and can/should we do it in time for the 2023 release in 27.0 next month?
We propose a combination of the alternatives listed below. Smooth Extension seems like the best short-term path forward to unblock the delimited migration. It mostly solves the problem and doesn't require any new features. The necessary changes for this approach have already been prepared, along with new conformance tests to lock down the behavior changes.
Global Feature is a good long-term mitigation for tech debt we're leaving behind with Smooth Extension. Ultimately we would like to remove any labeling of fields by their type, and editions provides a good mechanism to do this. Alternatively, we could implement aliases and use that to unify this old behavior and avoid a new feature. Either of these options will be the next step after the release of 2023, with aliases being preferred as long as the timing works out.
If we hit any unexpected delays, Nerf Delimited Encoding in 2023 (not available externally) is the quickest path forward to unblock the release of edition 2023. It has a lot of downsides though, and will block any migration towards delimited encoding until edition 2024 has started rolling out.
Instead of trying to change the existing behavior, we could expand the current spec to try to cover both proto2 and editions. We would define a “group-like” concept, which applies to all fields which:
DELIMITED
encodingNote that proto2 groups will always be “group-like.”
For any group-like field we will use the old proto2 semantics, whatever they are today. Otherwise, we will treat them as regular fields for both codegen and text-format. This means that most new cases of delimited encoding will have the desired behavior, while all old groups will continue to function. The main exception here is that users will see the unexpected proto2 behavior if they have message/field names that happen to match.
While the old behavior will result in some unexpected capitalization when it‘s hit, it’s mostly safe. Because of 2 and 3 (and the fact that we disallow duplicate field names), we can guarantee that in both codegen and text encoding there will never be any conflicting symbols. There can never be two delimited fields of the same type using the old behavior, and no other messages or fields will exist with either spelling.
Additionally, we will update the text parsers to accept both the old message-based spelling and the new field-based spelling for group-like fields. This will at least prevent parsing failures if users hit this unexpected change in behavior.
The simplest answer here is to introduce a new global message feature legacy_group_handling
to control all the changes we'd like. This will only be applicable to group-like fields (see Smooth Extension). With this feature enabled, these fields will always use their message name for text-format. Each non-conformant language could also use this feature to gate the codegen rules.
An extension of Global feature would be to split the codegen changes out into separate per-language features.
A quick fix to avoid releasing a bad feature would be to simply ban the case where the message and field names don't match. Adding this validation to protoc would cover the majority of cases, although we might want additional checks in every language that supports dynamic messages.
This is a good fallback option if we can't implement anything better before 27.0 is released. It allows us to release editions in a reasonable state, where we can fix these issues and release a more functional DELIMITED
feature in 2024.
While it might be tempting to leverage the edition 2023 upgrade as a place we can just rename the group field, that doesn't actually work (e.g. rename mygroup
to my_group
). Because so many runtimes already use the field name in generated APIs, they would break under this transformation.
We‘ve discussed aliases a lot mostly in the context of Any
, but they would be useful for any encoding scheme that locks down field/message names. If we had a fully implemented alias system in place, it would be the perfect mitigation here. Unfortunately, we don’t yet and the timeline here is probably too tight to implement one.
Doing nothing doesn't actually break anyone, but it is embarrassing.