Go Size Semantics
The proto.Size
function returns the size in bytes of the wire-format encoding of a
proto.Message by traversing all its fields (including submessages).
In particular, it returns the size of how Go Protobuf will encode the message.
Typical usages
Identifying empty messages
Checking if
proto.Size
returns
0 is an easy way to recognize empty messages:
if proto.Size(m) == 0 {
// No fields set (or, in proto3, all fields matching the default);
// skip processing this message, or return an error, or similar.
}
Size-limiting program output
Let’s say you’re writing a batch processing pipeline which produces work tasks for another system that we’ll call “downstream system” in this example. The downstream system is provisioned for handling small to medium-sized tasks, but load testing has shown that the system runs into a cascading failure when presented with a work task of over 500 MB.
The best fix is to add protection to the downstream system (see https://cloud.google.com/blog/products/gcp/using-load-shedding-to-survive-a-success-disaster-cre-life-lessons), but when implementing load-shedding is infeasible, you could decide to add a quick fix to your pipeline:
func (*beamFn) ProcessElement(key string, value []byte, emit func(proto.Message)) {
task := produceWorkTask(value)
if proto.Size(task) > 100 * 1024 * 1024 {
// Skip every work task over 100 MB to not overwhelm
// the brittle downstream system.
return
}
emit(task)
}
Incorrect usage: no relation to Unmarshal
Because proto.Size
returns the number of bytes for how Go Protobuf will encode the message, it is
not safe to use proto.Size
when unmarshaling (decoding) a stream of incoming
Protobuf messages:
func bytesToSubscriptionList(data []byte) ([]*vpb.EventSubscription, error) {
subList := []*vpb.EventSubscription{}
for len(data) > 0 {
subscription := &vpb.EventSubscription{}
if err := proto.Unmarshal(data, subscription); err != nil {
return nil, err
}
subList = append(subList, subscription)
data = data[:len(data)-proto.Size(subscription)]
}
return subList, nil
}
When data
contains a message in non-minimal wire format,
proto.Size
can return a different size than was actually unmarshaled,
resulting in a parsing error (best case) or incorrectly parsed data in the worst
case.
Hence, this example only works reliably as long as all input messages are generated by (the same version of) Go Protobuf. This is surprising and likely not intended.
Tip: Use the
protodelim
package
instead to read/write size-delimited streams of Protobuf messages.
Advanced usage: pre-sizing buffers
An advanced usage of
proto.Size
is to
determine the required size for a buffer before marshaling:
opts := proto.MarshalOptions{
// Possibly avoid an extra proto.Size in Marshal itself (see docs):
UseCachedSize: true,
}
// DO NOT SUBMIT without implementing this Optimization opportunity:
// instead of allocating, grab a sufficiently-sized buffer from a pool.
// Knowing the size of the buffer means we can discard
// outliers from the pool to prevent uncontrolled
// memory growth in long-running RPC services.
buf := make([]byte, 0, opts.Size(m))
var err error
buf, err = opts.MarshalAppend(buf, m) // does not allocate
// Note that len(buf) might be less than cap(buf)! Read below:
Note that when lazy decoding is enabled, proto.Size
might return more bytes
than proto.Marshal
(and variants like proto.MarshalAppend
) will write! So
when you are placing encoded bytes on the wire (or on disk), be sure to work
with len(buf)
and discard any previous proto.Size
results.
Specifically, a (sub-)message can “shrink” between proto.Size
and
proto.Marshal
when:
- Lazy decoding is enabled
- and the message arrived in non-minimal wire format
- and the message is not accessed before
proto.Size
is called, meaning it is not decoded yet - and the message is accessed after
proto.Size
(but beforeproto.Marshal
), causing it to be lazily decoded
The decoding results in any subsequent proto.Marshal
calls encoding the
message (as opposed to merely copying its wire format), which results in
implicit normalization to how Go encodes messages, which is currently in minimal
wire format (but don’t rely on that!).
As you can see, the scenario is rather specific, but nevertheless it is best
practice to treat proto.Size
results as an upper bound and never assume that
the result matches the actually encoded message size.
Background: Non-minimal wire format
When encoding Protobuf messages, there is one minimal wire format size and a number of larger non-minimal wire formats that decode to the same message.
Non-minimal wire format (also called “denormalized wire format” sometimes) refers to scenarios like non-repeated fields appearing multiple times, non-optimal varint encoding, packed repeated fields that appear non-packed on the wire and others.
We can encounter non-minimal wire format in different scenarios:
- Intentionally. Protobuf supports concatenating messages by concatenating their wire format.
- Accidentally. A (possibly third-party) Protobuf encoder does not encode ideally (e.g. uses more space than necessary when encoding a varint).
- Maliciously. An attacker could craft Protobuf messages specifically to trigger crashes over the network.