Skip to content

Serialization for Embedded Systems

Published On:
Apr 24, 2024
Last Updated:
Sep 1, 2024

This page details a way (one of many!) of serializing messages for embedded systems. The method described here is based on Google’s protobuf serialization format along with additional escaping and framing to allow for the data to be streamed across a communication channel like UART, SPI, a WebSocket or any other “byte stream” style communication.

Schema vs. Schemaless Serialization

Is it better to use a schema-based serialization format or a schemaless one for embedded systems? I generally think formats with schemas are a better fit for embedded systems, because:

  1. It allows for the creation of structs or classes in the embedded code that match the schema, which can be used to serialize and deserialize messages. Without these schema created structs it can be hard to represent the data without dynamic memory allocation (which is avoided at runtime in many embedded systems).
  2. The schema provides a way of defining and documenting your “API” for the messages that are sent between devices. In typed languages this can improve the developer experience because your message types can be objects containing the message fields. The types, variable names and comments in the .proto file(s) can also serve as the documentation of the API (you can also create proper documentation from .proto files, e.g. see the GitHub repo pseudomuto/protoc-gen-doc).

Schema based serialization formats include protobuf, Cap’n Proto, and FlatBuffers. Schemaless serialization formats include JSON, CBOR, and MessagePack.

protobuf

We are going to use protobuf to convert objects in computer memory into a stream of bytes (this is what serialization is).

protobuf is a schema-based binary serialization format developed by Google. It is one of the most popular serialization formats. The schema (the message definitions) is defined in a .proto file. protobuf is officially supported in C++, Java, Python and other languages. One language that is not officially supported is C! Thankfully, there is a popular 3rd party tool called Nanopb which provides support for C, and is well-suited for use on embedded systems. For each language, protobuf comes in two parts:

  1. A compiler that takes the .proto file and generates objects representing each message in the target language.
  2. A runtime library that provides functions to serialize and deserialize messages.

Protobuf serializes each variable in a message into a tag (a unique variable ID) and the variable value. Providing a tag for each variable in a message might seem like a waste. Why not just send them in order and get rid of all the bytes that the tags take up? The reason is to allow for both forwards and backwards compatibility when the message format changes. The ids allow newer code to receive an old message and create default values for variables that are not present, and older code to receive a new message and just ignore variables that it doesn’t know about. It also allows for optional variables and for the ability to just not send the variable if it’s set to it’s default value.

Let’s look at a basic example to see how protobuf works. Here is a simple .proto file:

message myMessage
{
uint32 myInt = 1;
}

We’ll set myInt to 7. This encodes to (I’m using https://www.protobufpal.com/ to generate the encoded message):

08 07

The first byte 0x08 is the tag shifted left by 3 bits (1 << 3). You can see this by changing the tag to 2 and seeing that the first byte changes to 0x10 (3 = 0x18, …). The second byte 0x07 is the value of the variable. uint32 is a variable-width type (there is no uint8 or uint16). Any uint32 up to 127 uses 2 bytes (1 byte for the tag and other data, 1 byte for the value). 128 uses 3 bytes:

Value Encoded Msg.
7 08 07
127 08 7F
128 08 80 01

protobuf supports a lot of nice types like string which is useful in the embedded world for serializing arbitrary length char *. It also supports bytes which is useful for serializing binary data. It also supports repeated which is useful for serializing arbitrary length arrays of other types.

One of it’s downsides for embedded systems is the lack of small variable types. It provides uint32 and bytes type (array of bytes), but no uint8, uint16 or bit fields! It does do clever serialization though and will encode small uint32 numbers in less than 4 bytes (as we saw above). However, we still lack the expressiveness of smaller types. And when using Nanopb, by default uint32s, will all map to uint32_t in the generated structs, which may be a memory issue for some users. Luckily, Nanopb provides a way of specifying smaller types in the .proto file, so that the resulting structs use smaller type (presumably an error is thrown during decoding if Nanopb received an encoded number that doesn’t fit in the smaller type).

If the lack of smaller types is a show stopper for you, have a look at bitproto (more on this in the Alternatives section).

protobuf knows how to decode a sequence of received bytes into a message as long as you know:

  1. The message type of the message.
  2. You provide it with the exact bytes that were used to create the message.

This leaves you with two problems if you are sending the encoded bytes in a stream-like fashion, i.e. across a UART:

  1. How do you know the type of the message, if you want to send more that one type of message?
  2. How do you know when one message ends and the next one starts?

Protobuf Defaults

protobuf does not send message fields that are set to their default value. In proto3 there is no ability to set a user-defined default value (there were good reasons for this, relating to portability across multiple languages). Instead, protobuf uses the following default values:

  • uint32, int32, sint32, fixed32, sfixed32: 0
  • bool: false
  • string, bytes, repeated: Array with 0 elements
  • Enum: First value in the enum (which is always 0 in proto3).
  • Message: Language-specific representation of null

Because of this, you have to be careful when considering basing logic around the “presence” of fields in the message. By default, for all field types except Message (i.e. all scalar types), you cannot distinguish between a field that was set to it’s default value and a field that was not assigned to at all. You can detect the presence of Message because it will be set to null if not assigned to (as per above).

protobuf provides an optional modifier if you do need to detect presence of scalar types. In this case, protobuf will send the field if it is set to it’s default value. How the encoder works out that the field was set is language specific. On the receiving/decoding side, protobuf provide an language specific API to check if the field was set or not, as well as it’s value.

The Message Type

There are a few ways to solve the problem of not knowing the message type. One way is

Higher-level languages can solve this problem by using a RPC client. gRPC would be the obvious choice if using protobuf, as it’s built on top of protobuf, with the RPC definitions being defined alongside the message definitions in the .proto file. This is not a great solution for low-level embedded systems though, as the gRPC server and client are quite heavy. There are embedded orientated RPC libraries like eRPC but they are not as well supported as protobuf for higher level languages.

The enum Method

One solution is to use a enum to define the message types and then send that along with your message (e.g. in the header before the message, we’ll cover the header in more detail later). Here is an example of this in a .proto file:

enum MessageType
{
HelloMessage = 1;
GoodbyeMessage = 2;
}
message HelloMessage
{
string greeting = 1;
}
message GoodbyeMessage
{
string greeting = 1;
}

The oneof Method

A different approach is to use protobuf’s oneof functionality, and create a wrapper message which contains oneof all the possible messages that can be sent. oneof allows you specify a field which can be one of a number of different types (think of it like a union). Here is an example of this in a .proto file:

message WrapperMsg
{
oneof innerMsg {
HelloMsg helloMsg = 1;
GoodbyeMsg goodbyeMsg = 2;
}
}
message HelloMsg
{
string greeting = 1;
}
message GoodbyeMsg
{
string greeting = 1;
}

Behind the scenes, protobuf will encode an ID for each possible message in the oneof field, and expose a way in each programming language to be able to determine which type it is. In a way, this is very similar to the enum method above, except that protobuf is generating the enum and including it in the message, rather than you adding it to a header.

One further benefit from the wrapped message approach is that you can add additional generic fields to the wrapper message which will then be added to all messages. For example, this could be useful for holding things such as a timestamp or CRC checksum.

Escaping and Framing

protobuf does not provide a way to determine when one message ends and the next one starts. Again, this is a problem if you want to send messages in a stream-like fashion. The solution to this is to further process the protobuf-encoded message by adding a specific end-of-packet delimiter. You then have to make sure to escape all occurrences of the delimiter in the message. This is called “escaping and framing”.

For example, you could pick the end-of-packet delimiter to be 0xFE (it’s a good idea to avoid commonly occurring bytes like 0x00 or 0xFF so that you don’t have to escape as many bytes). We will also pick a start-of-packet delimiter. This is not mandatory, but allows us to throw away bytes easily for example if we connect to the other device half-way through a packet. Let’s use 0xFD. We also need to pick an escape character, let’s choose 0xFC.

Then during encoding:

  • Every time we come across the end-of-packet delimiter 0xFE, replace it with 0xFC 0x00.
  • Every time we come across the start-of-packet delimiter 0xFD, replace it with 0xFC 0x01.
  • Every time we come across the escape character 0xFC, replace it with 0xFC 0x02 (this is escaping the escape character).

Now we have a message which is guaranteed to have no 0xFE nor 0xFD bytes in it, as we can safely prefix and suffix these bytes to the message as unique start-of-packet and end-of-packet identifiers.

For example, let’s say we have the following protobuf encoded message:

0x08 0xFE 0x01 0xFC 0xAA

This would be escaped to:

0x08 0xFC 0x00 0x01 0xFC 0x02 0xAA

And then framed:

|<- SOP |<- EOP
0xFD 0x08 0xFC 0x00 0x01 0xFC 0x02 0xAA 0xFE

This is what the sender send across the communication channel, e.g. UART, SPI, etc.

What does the receiver do? The receiver throws away bytes until it receives a start-of-packet delimiter. Then it buffers received bytes until it receives an end-of-packet delimiter. Then it performs the reverse of the escaping process to get the original protobuf encoded message. This can then be decoded with protobuf if you used the oneof method above.

If you decided to use the enum method to determine the message type, you would need to add the message type to the start of the encoded protobuf message before the escaping and framing process. You would then unpack this during decoding and calling the correct decode function based on the message type. Another thing you may want to do is add a CRC checksum. This would be added to the header similarly to the message type, and ideally be calculated over the protobuf encoded message and the message type enum (if used).

nanopb

nanopb allows you to specify extra information in the .proto file to help it generate structs when compiling for C. This is useful for fields such as string, repeated and bytes, which are all variable length. If you don’t specify anything, nanopb will expect you to use callbacks instead to handle the data as a stream. This is generally more cumbersome to work with than having fixed size members of message structs, so I strongly recommend specifying the extra options unless you have good reason not to.

First you have to add:

import "nanopb.proto";

max_size for bytes

You can use max_size to specify the maximum size of a bytes field:

message Image {
bytes data = 1 [(nanopb).max_size = 256];
}

nanopb will generate a struct with 256 bytes for the data field:

typedef PB_BYTES_ARRAY_T(256) BinaryData_data_t;
typedef struct _BinaryDataSet {
BinaryData_data_t data;
} BinaryData;

max_count for repeated

You can use max_count to specify the maximum number of elements in a repeated field:

/**
* Represents a single x,y cartesian point.
*/
message Point {
uint32 x = 1; // x coordinate, in the rang [0, 1].
uint32 y = 2; // y coordinate, in the rang [0, 1].
}
message PointsArray {
repeated Point points = 1 [(nanopb).max_count = 10];
}

nanopb will generate a struct with 10 elements for the points field, along with a points_count field to keep track of how many elements are in the array:

typedef struct _Point {
uint32_t x; /* x coordinate, in the rang [0, 1]. */
uint32_t y; /* y coordinate, in the rang [0, 1]. */
} Point;
typedef struct _PointsArray {
pb_size_t points_count;
Point points[10];
} PointsArray;

max_length for string

You can use max_length to set the max. length of a string field:

message HelloMsg {
string text = 1 [(nanopb).max_length = 40]; // The message text.
}

nanopb will generate a struct with a char array of length 41 for the text field (+1 to allow for the null terminator):

typedef struct _HelloMsgSet {
char text[41];
} HelloMsg;

No length is needed because the string length is determined by the null character.

These additional options don’t prevent you from compiling the .proto file for Python e.t.c with the official compiler, but you do have make sure nanopb.proto is importable even if these options are going to be ignored for this target language.

Installing nanopb also provides a version of protoc (the official protobuf compiler) that you can use.

Generating Python Code

The Python code generated by the official protobuf compiler is awful. It embeds a giant string into a Python file which generates code at runtime. This means no nice messages classes that you can inspect or be leveraged by Intellisense in a IDE.

Luckily, there are alternatives. danielgtaylor/python-betterproto is a popular one with 1.4k stars as of May 20241. It generates Python message classes (data classes, to be specific) from your .proto files. It also supports gRPC (not very useful for embedded).

Alternatives

bitproto is a protobuf-like serialization format which allows you to specify variable widths on the bit level, eliminating some of protobuf’s shortcomings when it comes to specifying small data types. It uses similar looking files to protobuf to describe messages.

The bitproto logo.

NordicSemiconductor/zcbor is a C (it’s also C++ compatible) CBOR library that uses CDDL to essentially provide a “schema” for defining the messages. C/C++ code is generated from this schema. As of May 2024, it has 100 stars on GitHub.

More Resources

See the protobuf page for general information on protobuf.

References

Footnotes

  1. Daniel G. Taylor. python-betterproto [repo]. Retrieved 2024-05-02, from https://github.com/danielgtaylor/python-betterproto.