SERIALIZATION FORMATS

A Comparison Of Serialization Formats

Date Published:
Last Modified:

Overview

Went you want to save, send or receive data from a piece of software, there are many different serialization formats to choose from. What is the best choice for your use case? This page aims to answer this question by comparing some of the most popular serialization formats.

The following serialization formats will be reviewed:

  • CSV
  • JSON
  • Protobuf
  • TOML
  • XML
  • YAML

There is no one-size-fits-all serialization format, as the best format for the jobs depends on things such as the type/amount of data that is being serialized and the software that will be reading it.

The examples in the following sections show how different formats store the same data. I chose a simple, repeatable data structure which was supported by all reviewed serialization formats (two Pokemon objects, each which had id, name, age and address fields). Note that many of these serialization formats can store non-repeatable, randomly structured data (in fact, all can except CSV).

In each review section, a score between 1-3 is highlighted red, 4-6 orange, and 6-10 green.

CSV

CSV is well suited to storing large amounts of tabulated data in a human-readable format. It is not well suited to storing objects or hash table like data structures (unlike every other serialization format that is reviewed here).

CSV is not very well standardized. RFC 4180 was an attempt to standardize the format, however the name “CSV” may refer to files which are delimited by non-comma characters such as spaces, tabs or semi-colons. In fact, it used to be called Delimiter Separated Values (DSV), although unfortunately CSV seems like the more prevalent term these days.

The CSV format allows an optional header line appearing as the first line in the file. If present, it contains field names for each value in a record. This header line is very useful the labelling the data and should almost always be present.

The CSV format is well supported, with CSV libraries available for almost every popular programming language.

For a human-readable format, CSV is quite concise (see the File Size section for more info). However, it can be difficult to work out what column is what, especially when there are a large number of rows (there is only one header column right at the top of the file), there are a large number of columns (there is no requirement of the columns being equal-spaced, and so you end up counting the commas from the left), and/or if there are empty fields (i.e. ,,).

Example

id, name, age, address
4, Charmander, 12.34, Fire St
25, Pikachu, 56.78, Electric St

Review

PropertyValueComment
Brevity9/10With only commas separating values, CSV is very concise.
Human Readability5/10CSV is readable, although it easy to get lost with a large amount of data.
Language Support9/10CSV is wide support and is readable is almost every major language.
Data Structure Support3/10CSV only supports tabular/array-like data. It does not support dictionary/map-like data, nor relational data.
Speed8/10CSV is very fast to serialize/deserialize.
Standardization3/10CSV is not well standardized.

JSON

JSON is a ubiquitous human-readable data serialization format that is supported by almost every popular programming language. Data structures closely represent common objects in many languages, e.g. a Python dict can be represented by a JSON object, and a Python list by a JSON array. Note there are caveats to this!

Unfortunately, the JSON syntax does not support comments! The best you can do is add a __comment__ name/value pair to JSON objects, which is a poor solution. The name in a JSON object’s name/value pairs always has to be a string. It also does not support any type of date format.

Example

[
    {
        "id": 4,
        "name": "Charmander",
        "age": 12.34,
        "address": "Fire Street"
    },
    {
        "id": 25,
        "name": "Pikachu",
        "age": 56.78,
        "address": "Electric Street"
    },
]

Review

PropertyValueComment
Brevity8/10JSON has concise syntax, although not as concise as YAML and TOML in most situations.
Human Readability5/10JSON is human-readable. It loses some marks because it does not support comments.
Language Support9/10
Data Type Support6/10JSON supports array and map (object) structures. It supports many different data types including strings, numbers, boolean, null, e.t.c, but not dates.
Speed7/10JSON is usually fast to serialize/deserialize.
Standardization9/10JSON has an official standards body. https://www.json.org/

Protocol Buffers (Protobuf)

Protobuf is a binary serialization protocol developed by Google. Since it serializes to binary, it is not human readable (although you can still pick out strings when viewing the file as ASCII text, see the example below).

Example

#^R^Charmander^Z^Fire Street%<A2>@
^A^R^Pikachu^Z^Electric Street%<F1>'OA

Review

PropertyValueComment
Brevity9/10Protobuf, being a binary format, is very concise. Would get 10/10 if it implemented some type of encoding scheme.
Human Readability1/10Protobuf is not designed to be human-readable.
Language Support7/10Protobuf supports C, C++, C#, Dart, Go, Java, , Objective-C, PHP, Python and Ruby.
Data Type Support8/10Protobuf allows you to define data structures in .proto files. Protobuf supports many basic primitive types, which can be combined into classes, which can then be combined into other classes.
Speed9/10Protobuf is very fast, especially in C++ (relative to other serialization formats).
Standardization9/10Protobuf is standardized by Google.

TOML

TOML (Tom’s Obvious, Minimal Language) is a newer (relative to the others in this review) human-readable serialization format. It is quite similar to YAML in that is is aimed towards configuration files, but is strives to be simpler format (YAML can get quite complex, and this can be seen in the much slower YAML parse times).

TOML has syntax highlighters for Atom, Visual Studio, Visual Studio Code and other IDEs.

TOML suffers from verbose syntax when it comes to expressing an array of objects (on in TOML speak, an array of tables). This can be seen in the example below where each pokemon object in the array is delimited with [[pokemon]].

[[pokemon]]
id = 4
name = "Charmander"
address = "Fire Street"
age = 12.34

[[pokemon]]
id = 25
name = "Pikachu"
address = "Electric Street"
age = 56.78

Review

PropertyValueComment
Brevity7/10TOML is quite concise, except for when it comes to arrays of tables.
Human Readability9/10One of TOML's primary goals was to be very easy to understand.
Language Support7/10
Data Type Support9/10TOML does not support references like YAML does (probably because TOML aimes to be simple).
Speed6/10TOML is on the slower end of the spectrum, but is faster than YAML.
Standardization9/10TOML is well standardized.

XML

XML is a human-readable serialization protocol. A well known XML-like format is HTML which is used to determine the structure of web pages.

One disadvantage of XML is it’s verbosity. It’s descriptive end tags which require you to re-type the name of the element that is being closed adds to the byte count of XML data.

The specification for XML can be found at https://www.w3.org/TR/xml/.

Example

<people>
    <person>
        <id>4</id>
        <name>Charmander</name>
        <age>12.34</age>
        <address>Fire St</address>
    </person>
    <person>
        <id>25</id>
        <name>Pikachu</name>
        <age>56.78</age>
        <address>Electric Street</address>
    </person>
</people>

Review

PropertyValueComment
Brevity3/10XML is not known for being short and sweet.
Human Readability5/10Human-readable, although you can get lost in-between all the tags in-front of your eyes.
Language Support9/10Supported in all major languages, usually with built-in libraries.
Data Type Support9/10XML is very flexible as each element can have attributes and arbitrary child elements.
Speed9/10See the Speed Comparison Benchmarking section for more info..
Standardization9/10

YAML

YAML (YAML Ain’t Markup Language)

The YAML specification is much larger the the JSON specification. YAML allows for relational data (references) using anchors (`). YAML gets some bonus style points since the YAML homepage is even displayed in YAML (https://yaml.org/).

YAML is a strict super-set of JSON, which means you can parse JSON with a YAML parser (the YAML parser will probably take longer though, so don’t use this trick with large amounts of JSON data!).

Example

- { id: 0, name: Charmander, age: 12.34, address: "Fire Street" }
- { id: 25, name: Pikachu, age: 56.78, address: "Electric Street" }

Review

PropertyValueComment
Brevity9/10Values can default to strings, allowing you to omit quote marks. It has terser syntax than TOML for arrays of objects (in TOML you have proceed each element with [[array_name]]).
Human Readability7/10Basic YAML is really easy to read, however YAML's complexity can confuse a reader when using it's advanced features.
Language Support6/10
Data Type Support10/10YAML even supports references (relational data)!
Speed3/10YAML showed the slowest serialization/deserialization runtimes out of any format I tested, in both C++ and Python (see the Speed Comparison section) for more info).
Standardization9/10YAML is well standardized.

Speed Comparison (Benchmarking)

The following libraries were used for the speed comparison tests:

FormatPythonC++
CSVcsv (built-in)fast-cpp-csv-parser (https://github.com/ben-strasser/fast-cpp-csv-parser)
JSONjson (built-in)json (https://github.com/nlohmann/json)
Protobufprotobuf (https://github.com/protocolbuffers/protobuf)protobuf (https://github.com/protocolbuffers/protobuf)
TOMLtoml (https://github.com/uiri/toml)cpptoml (https://github.com/skystrife/cpptoml)
YAMLPyYAML (https://pyyaml.org/)yaml-cpp (https://github.com/jbeder/yaml-cpp)
XMLElementTree (built-in)tinyxml2 (https://github.com/leethomason/tinyxml2<)

Python v3.7 was used for all Python tests. C++17/GCC compiler was used for all C++ tests. Tests ran on a Debian machine running inside a virtual machine. The purpose of this test was to show relative performance between the different serialization formats, which should be not be affected by running inside a virtual machine.

As to be representative of how the serialization data might be used, all write tests where passed the same input data, either a vector (for the C++ tests) or a List (for the Python tests) of Person objects. Each Person contains an ID (an integer starting from 0), a name (random string of 5 ASCII chars), and address (random string of 30 ASCII chars) and an age (float). Each test was required to serialize the data to the required format (using the libraries mentioned above) and then write the serialized data to disk. All read tests performed the opposite task, reading a data file, deserializing and creating a vector/List of Person objects.

3 iterations of each test where performed, and the smallest run time of the three was selected as the most representative. Larger runtimes are typically the result of the OS performing extraneous tasks.

FormatC++ Deserialization (s)C++ Serialization (s)Python DeserializationPython Serialization
csv0.0300.0220.0270.034
json0.160.130.0230.16
protobuf0.0150.0250.260.38
toml0.230.221.080.23
xml0.120.160.0630.25
yaml0.480.556.843.84
C++ conversion times for 10k objects in popular serialization formats.

C++ conversion times for 10k objects in popular serialization formats.

Python conversion times for 10k objects in popular serialization formats.

Python conversion times for 10k objects in popular serialization formats.

It is also interesting to see how the serializations respond to a change in the data size. If the size of the data doubles, does it take twice as long to read/write (linear response), or does it behave differently (e.g. quadratic, log(n), …). This is called the complexity of the serialization algorithm. To test this, I increased the people array from 10,000 to 100,000 (increased by a factor of 10). This was the result…

FormatC++ Deserialization (s)C++ Serialization (s)Python DeserializationPython Serialization
csv0.260.180.220.38
json1.531.500.201.59
protobuf0.130.242.623.61
toml2.232.199.862.08
xml0.851.740.782.58
yaml4.965.7069.6736.87
C++ conversion times for 100k objects in popular serialization formats.

C++ conversion times for 100k objects in popular serialization formats.

Python conversion times for 100k objects in popular serialization formats.

Python conversion times for 100k objects in popular serialization formats.

The code that performed these tests can be found at https://github.com/gbmhunter/BlogAssets/tree/master/Programming/serialization-formats.

File Size Comparison

When serializing large amounts of data, another important aspect is the verbosity of the format. To compare the verbosity of the different formats, we can pass each format the same data, dump the data to disk, and compare the file sizes.

FormatFile Size (MiB, 10k records)File Size (MiB, 100k records)
csv0.414.2
json0.818.2
protobuf0.383.9
toml0.949.5
xml1.5015
yaml0.808.1

As expected, the file sizes grow linearly with the number of records stored (10x amount of data = 10x the file size).

Comparative file sizes for popular serialization formats.

Comparative file sizes for popular serialization formats.

Being the only binary, non-human readable format that was compared, it’s with no surprise that protobuf is the most concise format. Closely behind protobuf was CSV. Because CSV does not support irregular, non-flat data structures, it only requires a value delimiter (e.g. ,) and end of line character (e.g. \n).

Other Formats That Weren’t Considered

  • BSON. A binary format popularized by MongoDB that is based on JSON.
  • MessagePack. This looks similar to protobuf (uses binary encoding). Has libraries for a wide variety of languages.

Like this page? Upvote with shurikens!

Related Content:

Tags:

comments powered by Disqus