Overview

Protocol Buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.

It’s like JSON, except it’s smaller and faster, and it generates native language bindings. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

Protocol buffers are a combination of the definition language (created in .proto files), the code that the proto compiler generates to interface with data, language-specific runtime libraries, the serialization format for data that is written to a file (or sent across a network connection), and the serialized data.

What Problems do Protocol Buffers Solve?

Protocol buffers provide a serialization format for packets of typed, structured data that are up to a few megabytes in size. The format is suitable for both ephemeral network traffic and long-term data storage. Protocol buffers can be extended with new information without invalidating existing data or requiring code to be updated.

Protocol buffers are the most commonly-used data format at Google. They are used extensively in inter-server communications as well as for archival storage of data on disk. Protocol buffer messages and services are described by engineer-authored .proto files. The following shows an example message:

message Person {
  optional string name = 1;
  optional int32 id = 2;
  optional string email = 3;
}

The proto compiler is invoked at build time on .proto files to generate code in various programming languages (covered in Cross-language Compatibility later in this topic) to manipulate the corresponding protocol buffer. Each generated class contains simple accessors for each field and methods to serialize and parse the whole structure to and from raw bytes. The following shows you an example that uses those generated methods:

Person john = Person.newBuilder()
    .setId(1234)
    .setName("John Doe")
    .setEmail("jdoe@example.com")
    .build();
output = new FileOutputStream(args[0]);
john.writeTo(output);

Because protocol buffers are used extensively across all manner of services at Google and data within them may persist for some time, maintaining backwards compatibility is crucial. Protocol buffers allow for the seamless support of changes, including the addition of new fields and the deletion of existing fields, to any protocol buffer without breaking existing services. For more on this topic, see Updating Proto Definitions Without Updating Code, later in this topic.

What are the Benefits of Using Protocol Buffers?

Protocol buffers are ideal for any situation in which you need to serialize structured, record-like, typed data in a language-neutral, platform-neutral, extensible manner. They are most often used for defining communications protocols (together with gRPC) and for data storage.

Some of the advantages of using protocol buffers include:

  • Compact data storage
  • Fast parsing
  • Availability in many programming languages
  • Optimized functionality through automatically-generated classes

Cross-language Compatibility

The same messages can be read by code written in any supported programming language. You can have a Java program on one platform capture data from one software system, serialize it based on a .proto definition, and then extract specific values from that serialized data in a separate Python application running on another platform.

The following languages are supported directly in the protocol buffers compiler, protoc:

The following languages are supported by Google, but the projects’ source code resides in GitHub repositories. The protoc compiler uses plugins for these languages:

Additional languages are not directly supported by Google, but rather by other GitHub projects. These languages are covered in Third-Party Add-ons for Protocol Buffers.

Cross-project Support

You can use protocol buffers across projects by defining message types in .proto files that reside outside of a specific project’s code base. If you’re defining message types or enums that you anticipate will be widely used outside of your immediate team, you can put them in their own file with no dependencies.

A couple of examples of proto definitions widely-used within Google are timestamp.proto and status.proto.

Updating Proto Definitions Without Updating Code

It’s standard for software products to be backward compatible, but it is less common for them to be forward compatible. As long as you follow some simple practices when updating .proto definitions, old code will read new messages without issues, ignoring any newly added fields. To the old code, fields that were deleted will have their default value, and deleted repeated fields will be empty. For information on what “repeated” fields are, see Protocol Buffers Definition Syntax later in this topic.

New code will also transparently read old messages. New fields will not be present in old messages; in these cases protocol buffers provide a reasonable default value.

When are Protocol Buffers not a Good Fit?

Protocol buffers do not fit all data. In particular:

  • Protocol buffers tend to assume that entire messages can be loaded into memory at once and are not larger than an object graph. For data that exceeds a few megabytes, consider a different solution; when working with larger data, you may effectively end up with several copies of the data due to serialized copies, which can cause surprising spikes in memory usage.
  • When protocol buffers are serialized, the same data can have many different binary serializations. You cannot compare two messages for equality without fully parsing them.
  • Messages are not compressed. While messages can be zipped or gzipped like any other file, special-purpose compression algorithms like the ones used by JPEG and PNG will produce much smaller files for data of the appropriate type.
  • Protocol buffer messages are less than maximally efficient in both size and speed for many scientific and engineering uses that involve large, multi-dimensional arrays of floating point numbers. For these applications, FITS and similar formats have less overhead.
  • Protocol buffers are not well supported in non-object-oriented languages popular in scientific computing, such as Fortran and IDL.
  • Protocol buffer messages don’t inherently self-describe their data, but they have a fully reflective schema that you can use to implement self-description. That is, you cannot fully interpret one without access to its corresponding .proto file.
  • Protocol buffers are not a formal standard of any organization. This makes them unsuitable for use in environments with legal or other requirements to build on top of standards.

Who Uses Protocol Buffers?

Many projects use protocol buffers, including the following:

How do Protocol Buffers Work?

The following diagram shows how you use protocol buffers to work with your data.

Compilation workflow showing the creation of a proto file, generated code, and compiled classes
Figure 1. Protocol buffers workflow

The code generated by protocol buffers provides utility methods to retrieve data from files and streams, extract individual values from the data, check if data exists, serialize data back to a file or stream, and other useful functions.

The following code samples show you an example of this flow in Java. As shown earlier, this is a .proto definition:

message Person {
  optional string name = 1;
  optional int32 id = 2;
  optional string email = 3;
}

Compiling this .proto file creates a Builder class that you can use to create new instances, as in the following Java code:

Person john = Person.newBuilder()
    .setId(1234)
    .setName("John Doe")
    .setEmail("jdoe@example.com")
    .build();
output = new FileOutputStream(args[0]);
john.writeTo(output);

You can then deserialize data using the methods protocol buffers creates in other languages, like C++:

Person john;
fstream input(argv[1], ios::in | ios::binary);
john.ParseFromIstream(&input);
int id = john.id();
std::string name = john.name();
std::string email = john.email();

Protocol Buffers Definition Syntax

When defining .proto files, you can specify that a field is either optional or repeated (proto2 and proto3) or leave it set to the default, implicit presence, in proto3. (The option to set a field to required is absent in proto3 and strongly discouraged in proto2. For more on this, see “Required is Forever” in Specifying Field Rules.)

After setting the optionality/repeatability of a field, you specify the data type. Protocol buffers support the usual primitive data types, such as integers, booleans, and floats. For the full list, see Scalar Value Types.

A field can also be of:

  • A message type, so that you can nest parts of the definition, such as for repeating sets of data.
  • An enum type, so you can specify a set of values to choose from.
  • A oneof type, which you can use when a message has many optional fields and at most one field will be set at the same time.
  • A map type, to add key-value pairs to your definition.

In proto2, messages can allow extensions to define fields outside of the message, itself. For example, the protobuf library’s internal message schema allows extensions for custom, usage-specific options.

For more information about the options available, see the language guide for proto2 or proto3.

After setting optionality and field type, you choose a name for the field. There are some things to keep in mind when setting field names:

  • It can sometimes be difficult, or even impossible, to change field names after they’ve been used in production.
  • Field names cannot contain dashes. For more on field name syntax, see Message and Field Names.
  • Use pluralized names for repeated fields.

After assigning a name to the field, you assign a field number. Field numbers cannot be repurposed or reused. If you delete a field, you should reserve its field number to prevent someone from accidentally reusing the number.

Additional Data Type Support

Protocol buffers support many scalar value types, including integers that use both variable-length encoding and fixed sizes. You can also create your own composite data types by defining messages that are, themselves, data types that you can assign to a field. In addition to the simple and composite value types, several common types are published.

History

To read about the history of the protocol buffers project, see History of Protocol Buffers.

Protocol Buffers Open Source Philosophy

Protocol buffers were open sourced in 2008 as a way to provide developers outside of Google with the same benefits that we derive from them internally. We support the open source community through regular updates to the language as we make those changes to support our internal requirements. While we accept select pull requests from external developers, we cannot always prioritize feature requests and bug fixes that don’t conform to Google’s specific needs.

Developer Community

To be alerted to upcoming changes in Protocol Buffers and to connect with protobuf developers and users, join the Google Group.

Additional Resources