Protobuf Tip #7: Scoping It Out

You’d need a very specialized electron microscope to get down to the level to actually see a single strand of DNA. – Craig Venter

TL;DR: buf convert is a powerful tool for examining wire format dumps, by converting them to JSON and using existing JSON analysis tooling. protoscope can be used for lower-level analysis, such debugging messages that have been corrupted.

I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog.

JSON from Protobuf?

JSON’s human-readable syntax is a big reason why it’s so popular, possibly second only to built-in support in browsers and many languages. It’s easy to examine any JSON document using tools like online prettifiers and the inimitable jq.

But Protobuf is a binary format! This means that you can’t easily use jq -like tools with it…or can you?

Transcoding with buf convert

The Buf CLI offers a utility for transcoding messages between the three Protobuf encoding formats: the wire format, JSON, and textproto; it also supports YAML. This is buf convert, and it’s very powerful.

To perform a conversion, we need four inputs:

  1. A Protobuf source to get types out of. This can be a local .proto file, an encoded FileDescriptorSet , or a remote BSR module.
    • If not provided, but run in a directory that is within a local Buf module, that module will be used as the Protobuf type source.
  2. The name of the top-level type for the message we want to transcode, via the --type flag.
  3. The input message, via the --from flag.
  4. A location to output to, via the --to flag.

buf convert supports input and output redirection, making it usable as part of a shell pipeline. For example, consider the following Protobuf code in our local Buf module:

// my_api.proto
syntax = "proto3";
package my.api.v1;

message Cart {
  int32 user_id = 1;
  repeated Order orders = 2;
}

message Order {
  fixed64 sku = 1;
  string sku_name = 2;
  int64 count = 3;
}
Protobuf

Then, let’s say we’ve dumped a message of type my.api.v1.Cart from a service to debug it. And let’s say…well—you can’t just cat it.

$ cat dump.pb | xxd -ps
08a946121b097ac8e80400000000120e76616375756d20636c65616e6572
18011220096709b519000000001213686570612066696c7465722c203220
7061636b1806122c093aa8188900000000121f69736f70726f70796c2061
6c636f686f6c203730252c20312067616c6c6f6e1802
Console

However, we can use buf convert to turn it into some nice JSON. We can then pipe it into jq to format it.

$ buf convert --type my.api.v1.Cart --from dump.pb --to -#format=json | jq
{
  "userId": 9001,
  "orders": [
    {
      "sku": "82364538",
      "skuName": "vacuum cleaner",
      "count": "1"
    },
    {
      "sku": "431294823",
      "skuName": "hepa filter, 2 pack",
      "count": "6"
    },
    {
	    "sku": "2300094522",
      "skuName": "isopropyl alcohol 70%, 1 gallon",
      "count": "2"
    }
  ]
}
Console

Now you have the full expressivity of jq at your disposal. For example, we could pull out the user ID for the cart:

$ function buf-jq() { buf convert --type $1 --from $2 --to -#format=json | jq $3 }
$ buf-jq my.api.v1.Cart dump.pb '.userId'
9001
Console

Or we can extract all of the SKUs that appear in the cart:

$ buf-jq my.api.v1.Cart dump.pb '[.orders[].sku]'
[
  "82364538",
  "431294823",
  "2300094522"
]
Console

Or we could try calculating how many items are in the cart, total:

$ buf-jq my.api.v1.Cart dump.pb '[.orders[].count] | add'
"162"
Console

Wait. That’s wrong. The answer should be 9. This illustrates one pitfall to keep in mind when using jq with Protobuf. Protobuf will sometimes serialize numbers as quoted strings (the C++ reference implementation only does this when they’re integers outside of the IEEE754 representable range, but Go is somewhat lazier, and does it for all 64-bit values).

You can test if an x int64 is in the representable float range with this very simple check: int64(float64(x)) == x). See https://go.dev/play/p/T81SbbFg3br. The equivalent version in C++ is much more complicated.

This means we need to use the tonumber conversion function:

$ buf-jq my.api.v1.Cart dump.pb '[.orders[].count | tonumber] | add'
9
Console

jq ’s whole deal is JSON, so it brings with it all of JSON’s pitfalls. This is notable for Protobuf when trying to do arithmetic on 64-bit values. As we saw above, Protobuf serializes integers outside of the 64-bit float representable range (and in some runtimes, some integers inside it).

For example, if you have a repeated int64 that you want to sum over, it may produce incorrect answers due to floating-point rounding. For notes on conversions in jq, see https://jqlang.org/manual/#identity.

Disassembling with protoscope

protoscope is a tool provided by the Protobuf team (which I originally wrote!) for decoding arbitrary data as if it were encoded in the Protobuf wire format. This process is called disassembly. It’s designed to work without a schema available, although it doesn’t produce especially clean output.

$ go install github.com/protocolbuffers/protoscope/cmd/protoscope...@latest
$ protoscope dump.pb
1: 9001
2: {
  1: 82364538i64
  2: {"vacuum cleaner"}
  3: 1
}
2: {
  1: 431294823i64
  2: {
    13: 101
    14: 97
    4: 102
    13: 1.3518748403899336e-153   # 0x2032202c7265746ci64
    14: 97
    12:SGROUP
    13:SGROUP
  }
  3: 6
}
2: {
  1: 2300094522i64
  2: {"isopropyl alcohol 70%, 1 gallon"}
  3: 2
}
Console

The field names are gone; only field numbers are shown. This example also reveals an especially glaring limitation of protoscope, which is that it can’t tell the difference between string and message fields, so it guesses according to some heuristics. For the first and third elements it was able to grok them as strings, but for orders[1].sku_name, it incorrectly guessed it was a message and produced garbage.

The tradeoff is that not only does protoscope not need a schema, it also tolerates almost any error, making it possible to analyze messages that have been partly corrupted. If we flip a random bit somewhere in orders[0], disassembling the message still succeeds:

$ protoscope dump.pb
1: 9001
2: {`0f7ac8e80400000000120e76616375756d20636c65616e65721801`}
2: {
  1: 431294823i64
  2: {
    13: 101
    14: 97
    4: 102
    13: 1.3518748403899336e-153   # 0x2032202c7265746ci64
    14: 97
    12:SGROUP
    13:SGROUP
  }
  3: 6
}
2: {
  1: 2300094522i64
  2: {"isopropyl alcohol 70%, 1 gallon"}
  3: 2
}
Console

Although protoscope did give up on disassembling the corrupted submessage, it still made it through the rest of the dump.

Like buf convert, we can give protoscope a FileDescriptorSet to make its heuristic a little smarter.

$ protoscope \
  --descriptor-set <(buf build -o -) \
  --message-type my.api.v1.Cart \
  --print-field-names \
  dump.pb
1: 9001                   # user_id
2: {                      # orders
  1: 82364538i64          # sku
  2: {"vacuum cleaner"}   # sku_name
  3: 1                    # count
}
2: {                          # orders
  1: 431294823i64             # sku
  2: {"hepa filter, 2 pack"}  # sku_name
  3: 6                        # count
}
2: {                                      # orders
  1: 2300094522i64                        # sku
  2: {"isopropyl alcohol 70%, 1 gallon"}  # sku_name
  3: 2                                    # count
}
Console

Not only is the second order decoded correctly now, but protoscope shows the name of each field (via --print-field-names ). In this mode, protoscope still decodes partially-valid messages.

protoscope also provides a number of other flags for customizing its heuristic in the absence of a FileDescriporSet. This enables it to be used as a forensic tool for debugging messy data corruption bugs.

Protobuf Tip #6: The Subtle Dangers of Enum Aliases

I’ve been very fortunate to dodge a nickname throughout my entire career. I’ve never had one. – Jimmie Johnson

TL;DR: Enum values can have aliases. This feature is poorly designed and shouldn’t be used. The ENUM_NO_ALLOW_ALIAS Buf lint rule prevents you from using them by default.

I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog.

Confusion and Breakage

Protobuf permits multiple enum values to have the same number. Such enum values are said to be aliases of each other. Protobuf used to allow this by default, but now you have to set a special option, allow_alias, for the compiler to not reject it.

This can be used to effectively rename values without breaking existing code:

package myapi.v1;

enum MyEnum {
  option allow_alias = true;
  MY_ENUM_UNSPECIFIED = 0;
  MY_ENUM_BAD = 1 [deprecated = true];
  MY_ENUM_MORE_SPECIFIC = 1;
}
Protobuf

This works perfectly fine, and is fully wire-compatible! And unlike renaming a field (see TotW #1), it won’t result in source code breakages.

But if you use either reflection or JSON, or a runtime like Java that doesn’t cleanly allow enums with multiple names, you’ll be in for a nasty surprise.

For example, if you request an enum value from an enum using reflection, such as with protoreflect.EnumValueDescriptors.ByNumber(), the value you’ll get is the one that appears in the file lexically. In fact, both myapipb.MyEnum_MY_ENUM_BAD.String() and myapipb.MyEnum_MY_ENUM_MORE_SPECIFIC.String() return the same value, leading to potential confusion, as the old “bad” value will be used in printed output like logs.

You might think, “oh, I’ll switch the order of the aliases”. But that would be an actual wire format break. Not for the binary format, but for JSON. That’s because JSON preferentially stringifies enum values by using their declared name (if the value is in range). So, reordering the values means that what once serialized as {"my_field": "MY_ENUM_BAD"} now serializes as {"my_field": "MY_ENUM_MORE_SPECIFIC"} .

If an old binary that hasn’t had the new enum value added sees this JSON document, it won’t parse correctly, and you’ll be in for a bad time.

You can argue that this is a language bug, and it kind of is. Protobuf should include an equivalent of json_name for enum values, or mandate that JSON should serialize enum values with multiple names as a number, rather than an arbitrarily chosen enum name. The feature is intended to allow renaming of enum values, but unfortunately Protobuf hobbled it enough that it’s pretty dangerous.

What To Do

Instead, if you really need to rename an enum value for usability or compliance reasons (ideally, not just aesthetics) you’re better off making a new enum type in a new version of your API. As long as the enum value numbers are the same, it’ll be binary-compatible, but it will somewhat reduce the risk of the above JSON confusion.

Buf provides a lint rule against this feature, ENUM_NO_ALLOW_ALIAS , and Protobuf requires that you specify a magic option to enable this behavior, so in practice you don’t need to worry about this. But remember, the consequences of enum aliases go much further than JSON—they affect anything that uses reflection. So even if you don’t use JSON, you can still get burned.

Protobuf Tip #5: Avoid import public/weak

My dad had a guitar but it was acoustic, so I smashed a mirror and glued broken glass to it to make it look more metal. It looked ridiculous! –Max Cavalera

TL;DR: Avoid import public and import weak. The Buf lint rules IMPORT_NO_PUBLIC and IMPORT_NO_WEAK enforce this for you by default.

I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog.

Protobuf imports allow you to specify two special modes: import public and import weak. The Buf CLI lints against these by default, but you might be tempted to try using them anyway, especially because some GCP APIs use import public. What are these modes, and why do they exist?

Import Visibility

Protobuf imports are by file path, a fact that is very strongly baked into the language and its reflection model.

import "my/other/api.proto";
Protobuf

Importing a file dumps all of its symbols into the current file. For the purposes of name resolution, it’s as if all if the declarations in that file have been pasted into the current file. However, this isn’t transitive. If:

  • a.proto imports b.proto
  • and b.proto imports c.proto
  • and c.proto defines foo.Bar
  • then, a.proto must import c.proto to refer to foo.Bar, even though b.proto imports it.

This is similar to how importing a package as . works in Go. When you write import . "strings", it dumps all of the declarations from the strings package into the current file, but not those of any files that "strings" imports.

Now, what’s nice about Go is that packages can be broken up into files in a way that is transparent to users; users of a package import the package, not the files of that package. Unfortunately, Protobuf is not like that, so the file structure of a package leaks to its callers.

import public was intended as a mechanism for allowing API writers to break up files that were getting out of control. You can define a new file new.proto for some of the definitions in big.proto, move them to the new file, and then add import public "new.proto"; to big.proto. Existing imports of big.proto won’t be broken, hooray!

Except this feature was designed for C++. In C++, each .proto file maps to a .proto.h header, which you #include in your application code. In C++, #include behaves like import public, so marking an import as public only changes name resolution in Protobuf—the C++ backend doesn’t have to do anything to maintain source compatibility when an import is changed to public.

But other backends, like Go, do not work this way: import in Go doesn’t pull in symbols transitively, so Go would need to explicitly add aliases for all of the symbols that come in through a public import. That is, if you had:

// foo.proto
package myapi.v1;
message Foo { ... }

// bar.proto
package myotherapi.v1;
import public "foo.proto";
Protobuf

Then the Go backend has to generate a type Foo = foopb.Foo in bar.pb.go to emulate this behavior (in fact, I was surprised to learn Go Protobuf implements this at all). Go happens to implement public imports correctly, but not all backends are as careful, because this feature is obscure.

The spanner.proto example of an import public isn’t even used for breaking up an existing file; instead, it’s used to not make a huge file bigger and avoid making callers have to add an additional import. This is a bad use of a bad feature!

Using import public to effectively “hide” imports makes it harder to understand what a .proto file is pulling in. If Protobuf imports were at the package/symbol level, like Go or Java, this feature would not need to exist. Unfortunately, Protobuf is closely tailored for C++, and this is one of the consequences.

Instead of using import public to break up a file, simply plan to break up the file in the next version of the API.

The IMPORT_NO_PUBLIC Buf lint rule enforces that no one uses this feature by default. It’s tempting, but the footguns aren’t worth it.

Weak Imports

Public imports have a good, if flawed, reason to exist. Their implementation details are the main thing that kneecaps them.

Weak imports, however, simply should not exist. They were added to the language to make it easier for some of Google’s enormous binaries to avoid running out of linker memory, by making it so that message types could be dropped if they weren’t accessed. This means that weak imports are “optional”—if the corresponding descriptors are missing at runtime, the C++ runtime can handle it gracefully.

This leads to all kinds of implementation complexity and subtle behavior differences across runtimes. Most runtimes implement (or implemented, in the case of those that removed support) import weak in a buggy or inconsistent way. It’s unlikely the feature will ever be truly removed, even though Google has tried.

Don’t use import weak. It should be treated as completely non-functional. The IMPORT_NO_WEAK Buf lint rule takes care of this for you.