Go Size Semantics

Explains how (not) to use proto.Size

The proto.Size function returns the size in bytes of the wire-format encoding of a proto.Message by traversing all its fields (including submessages).

In particular, it returns the size of how Go Protobuf will encode the message.

Typical usages

Identifying empty messages

Checking if proto.Size returns 0 is an easy way to recognize empty messages:

if proto.Size(m) == 0 {
    // No fields set (or, in proto3, all fields matching the default);
    // skip processing this message, or return an error, or similar.
}

Size-limiting program output

Let’s say you’re writing a batch processing pipeline which produces work tasks for another system that we’ll call “downstream system” in this example. The downstream system is provisioned for handling small to medium-sized tasks, but load testing has shown that the system runs into a cascading failure when presented with a work task of over 500 MB.

The best fix is to add protection to the downstream system (see https://cloud.google.com/blog/products/gcp/using-load-shedding-to-survive-a-success-disaster-cre-life-lessons), but when implementing load-shedding is infeasible, you could decide to add a quick fix to your pipeline:

func (*beamFn) ProcessElement(key string, value []byte, emit func(proto.Message)) {
  task := produceWorkTask(value)
  if proto.Size(task) > 100 * 1024 * 1024 {
    // Skip every work task over 100 MB to not overwhelm
    // the brittle downstream system.
    return
  }
  emit(task)
}

Incorrect usage: no relation to Unmarshal

Because proto.Size returns the number of bytes for how Go Protobuf will encode the message, it is not safe to use proto.Size when unmarshaling (decoding) a stream of incoming Protobuf messages:

func bytesToSubscriptionList(data []byte) ([]*vpb.EventSubscription, error) {
    subList := []*vpb.EventSubscription{}
    for len(data) > 0 {
        subscription := &vpb.EventSubscription{}
        if err := proto.Unmarshal(data, subscription); err != nil {
            return nil, err
        }
        subList = append(subList, subscription)
        data = data[:len(data)-proto.Size(subscription)]
    }
    return subList, nil
}

When data contains a message in non-minimal wire format, proto.Size can return a different size than was actually unmarshaled, resulting in a parsing error (best case) or incorrectly parsed data in the worst case.

Hence, this example only works reliably as long as all input messages are generated by (the same version of) Go Protobuf. This is surprising and likely not intended.

Tip: Use the protodelim package instead to read/write size-delimited streams of Protobuf messages.

Advanced usage: pre-sizing buffers

An advanced usage of proto.Size is to determine the required size for a buffer before marshaling:

opts := proto.MarshalOptions{
    // Possibly avoid an extra proto.Size in Marshal itself (see docs):
    UseCachedSize: true,
}
// DO NOT SUBMIT without implementing this Optimization opportunity:
// instead of allocating, grab a sufficiently-sized buffer from a pool.
// Knowing the size of the buffer means we can discard
// outliers from the pool to prevent uncontrolled
// memory growth in long-running RPC services.
buf := make([]byte, 0, opts.Size(m))
var err error
buf, err = opts.MarshalAppend(buf, m) // does not allocate
// Note that len(buf) might be less than cap(buf)! Read below:

Note that when lazy decoding is enabled, proto.Size might return more bytes than proto.Marshal (and variants like proto.MarshalAppend) will write! So when you are placing encoded bytes on the wire (or on disk), be sure to work with len(buf) and discard any previous proto.Size results.

Specifically, a (sub-)message can “shrink” between proto.Size and proto.Marshal when:

  1. Lazy decoding is enabled
  2. and the message arrived in non-minimal wire format
  3. and the message is not accessed before proto.Size is called, meaning it is not decoded yet
  4. and the message is accessed after proto.Size (but before proto.Marshal), causing it to be lazily decoded

The decoding results in any subsequent proto.Marshal calls encoding the message (as opposed to merely copying its wire format), which results in implicit normalization to how Go encodes messages, which is currently in minimal wire format (but don’t rely on that!).

As you can see, the scenario is rather specific, but nevertheless it is best practice to treat proto.Size results as an upper bound and never assume that the result matches the actually encoded message size.

Background: Non-minimal wire format

When encoding Protobuf messages, there is one minimal wire format size and a number of larger non-minimal wire formats that decode to the same message.

Non-minimal wire format (also called “denormalized wire format” sometimes) refers to scenarios like non-repeated fields appearing multiple times, non-optimal varint encoding, packed repeated fields that appear non-packed on the wire and others.

We can encounter non-minimal wire format in different scenarios:

  • Intentionally. Protobuf supports concatenating messages by concatenating their wire format.
  • Accidentally. A (possibly third-party) Protobuf encoder does not encode ideally (e.g. uses more space than necessary when encoding a varint).
  • Maliciously. An attacker could craft Protobuf messages specifically to trigger crashes over the network.