TopHome
<2023-08-25 Fri>tech

Protobuf Infodump

I am getting more into the habit of infodumping, so here is another one in this series, this time on Protobuf.

The official Protobuf library does not have C versions, but it is easy to find third party implementations:

I also found this interesting C++ version https://github.com/mapbox/protozero which claims to be high performance version by operating at a lower level than other libraries. Their tutorial is particularly interesting - it seems that you have to manually examine tags and cast the data into the right form.

Also look at the Nim version: https://github.com/PMunch/protobuf-nim which is an in-language approach that doesn't need any external compilers etc.

While playing with the Nim version, I faced some problem in decoding Protobufs from an external source. Trying to debug that led me to protoscope.

This tool is pretty useful. My workflow became:

  1. Try to deserialize the message from my Nim program. This didn't work.
  2. Dump a hex representation of the protobuf message binary from Nim into a file.
  3. xxd -r -ps <file with hex data> | protoscope to output Protoscope's understanding of the message.
  4. If the above step fails (ie, the output does not match what you expect), you know that the message itself is incorrectly formed - ie, not a valid protobuf message.

In the good scenario, you would get the skeleton structure from protoscope, something like the following:

1: {
  1: {
    1: {"__name__"}
    2: {"sample_metric"}
  }
  1: {
    1: {"key"}
    2: {"value1"}
  }
  2: {
    1: 289.0  # 0x4072100000000000i64
    2: 1692785281000
  }
}

What is this?

  1. There is one top level field with tag 1. This is a submessage.
  2. Inside it are 2 submessages of tag 1 and one of tag 2.
  3. The inner submessages are obvious.

(In case this looks vaguely familiar, this is the protobuf structure used in Prometheus Remote Writes.)

Finally, protoscope can regenerate messages from this skeleton format.

  1. Save the above output to a text file.
  2. Make changes.
  3. protoscope -s <filename> | xxd -p > output.txt will give you the hex dump of new proto message.

One more thing. I had 2 such hexdump files which I wanted to compare. Normal diff doesn't really work. git-diff to the rescue: you don't have to be in a git repo to use it.

git diff --no-index --word-diff=color --word-diff-regex=. ver1.txt ver2.txt

shows a character level difference between the 2 text files (in this case containing data in hex format) with the nice green/red syntax.

While on this topic, I also found the following slightly unrelated things:

  1. https://github.com/dbcode/protobuf-nginx: protobuf encoding/decoding for Nginx modules using Nginx specific data structures.
  2. This MICRO'21 paper from Google about Protobuf deserialization in Hardware: https://dl.acm.org/doi/pdf/10.1145/3466752.3480051