Over the past few months, I have started to learn Zig. As part of another side project that I am working on, I discovered that I needed to create a bunch of data structures to hold in-flight data. Now, there are a few ways I could have gone about doing this!

  1. Probably the easiest... Just write the code! The problem is, I am lazy and I didnt want to just write a bunch of data structures for the hell of it. Also, what do I really learn if I just write code?
  2. Lean on the go generate pattern and define my data structures in some kind of markup language and then write Zig code to generate my data structures as part of my build process.

Number 2 sounds like the winner to me. Write a bunch of code to generate code that I don't want to write. I AM GAME!

Related Articles

So, why did i NEED to write a YAML parser??

The Zig stdlib has a JSON parser built right in. So, why not use JSON then? As I stated earlier, I am a bit lazy and JSON has far too many characters to type. Would JSON have been easier?

Probably!

The truth is, I have done a lot of text parsing in my day. I have never actually looked into what it takes to parse a structured markup file like YAML or JSON. I get the basic concepts of structured markup but I have always relied on pre-built libs to perform the actually deserializing with data structure hydration. I have never dug into the real nuts and bolts of what is happening under the covers.

In the spirit of learning, which is why I took up Zig to begin with, I decided to make a HARD right turn away from easy and dive into parsing YAML from scratch-ish.

Below you will find a high level guide on YAML parsing that fit my specific use cases. There will be some holes as this entire endeavour was meant to build an MVP which fit my specific use case.

The Caveats

I have done a fair amount of data parsing and tokenizing in my career, and there is only so much pain I am willing to bare on this "side" project. I don't need to go so low level that I am accumulating and parsing bytes because I have 10,000 hours of that under my belt already.

The reason I chose Zig to begin with is its interoperability with C. C is where ALL of the fun stuff really happens. You would not have Golang, CPP, PHP, Linux, or anything else without C.

I chose to use libyaml, which is built and maintained by yaml.org, to do the data parsing. This may get swapped to libfyaml in the future for various spec related reasons that are well outside of the scope of this article.

Why not Rust?

Because...

How does YAML parsing work?

There is a whole spec you can read, but who has time for that!

The nuts and bolts of YAML parsing are these events:

const YamlEvents = enum(c_int) {
    YAML_NO_EVENT,
    YAML_STREAM_START_EVENT,
    YAML_STREAM_END_EVENT,
    YAML_DOCUMENT_START_EVENT,
    YAML_DOCUMENT_END_EVENT,
    YAML_ALIAS_EVENT,
    YAML_SCALAR_EVENT,
    YAML_SEQUENCE_START_EVENT,
    YAML_SEQUENCE_END_EVENT,
    YAML_MAPPING_START_EVENT,
    YAML_MAPPING_END_EVENT,
};

Each of these are events that are emitted by the official C library that tokenizes YAML for parsing. We will walk through each of the events to get a better understanding of what is happening under the covers.

YAML_NO_EVENT

This is probably fired when the byte stream fed into the YAML library is empty.

I don't really know what this does, so let's move on :)

YAML_STREAM_START_EVENT and YAML_STREAM_END_EVENT

These events denote the start and end of a byte stream. Normally, when you look at some text in a text editor or a webpage, we may forget that bits and bytes are actually representing the data under the covers. The parser reads the data byte by byte and it needs to emit an event when it reads the first byte and when it finishes the last byte.

YAML_DOCUMENT_START_EVENT and YAML_DOCUMENT_END_EVENT

These events denote the start and end of a document. A YAML byte stream should start with --- as the first 3 bytes. When those characters are seen, the parser will emit a YAML_DOCUMENT_START_EVENT. If the --- characters are seen again the same byte stream, a YAML_DOCUMENT_END_EVENT will be emitted followed immediately by another YAML_DOCUMENT_START_EVENT.

Below is a YAML document annotated with the events commented inline:

#YAML_DOCUMENT_START_EVENT
---
key: value
key2:
  key3: value2
#YAML_DOCUMENT_END_EVENT
#YAML_DOCUMENT_START_EVENT
---
key4: value3
key5:
  - "some value 1"
  - "some value 2"
#YAML_DOCUMENT_END_EVENT

Essentially, these events break up multiple "files" or "documents" that all can be combined into a single byte stream.

YAML_ALIAS_EVENT

I did not implement this event in my parser so I am not going to offer any explaination of the specific event here.

A YAML alias esentially lets you lookup an Anchor somewhere else in your YAML file to reduce data duplication. I may update my parser in the future to support this at which point I will update this article to reflect that change.

YAML_SCALAR_EVENT

A scalar is a single simple data value that represents a single piece of information. In the context of this parser, and more specifically Zig land, this is a [*]u8 value. What is a [*]u8 value?? Why is there a pointer inside of the array?

This value comes from C land. It is common in C to pass both an array of pointers along with a length to functions. The function will read the length and subsequently iterate through the pointers using the length as a backstop. This needs to be translated from C to Zig so that Zig knows how to deal with a multi-pointer char.

Zig does not have the idea of strings built in and all characters are represented by an unsigned integer. In this case it is an 8bit unsigned integer or a u8.

This means that the scalar value is an array of 8bit return values. This gets much more complex when you are dealing with a UTF16 character set, but this MVP implementation is only accounting for UTF8 because IDK how to make my keyboard type anything else.

When a scalar event shows up, again ignoring ALIAS events for now, it means that you are either getting a key or a value. As the developer, it is up to you to write the logic that knows when you are getting a key or value. I did this with a simple ENUM flip flop.

const YamlCurrentlyParsing = enum {
    KEY,
    VALUE,
};

// flipidy flop
var parsing_state: YamlCurrentlyParsing = .KEY;

Depending on the structure of your file, you may get multiple scalar events in a row. In that case, inside of the scalar event, I will first read a key and then flip parsing_state to a value. Once I read a value, I will flip parsing_state back to key.

There are times when you will read a key and then immediately get a sequence or mapping event. The flip flop would have already changed to value, but for those specific events I move the flip flop back to key knowing that I am reading an array or object.

Here is a simple stripped down example below:

while(true) {
    switch (@as(YamlEvents, @enumFromInt(event.type))) {
        // removed extra events for readability
        .YAML_SCALAR_EVENT => {
            if (parsing_state == .KEY) {
                parsing_state = .VALUE;
                continue;
            }

            if (parsing_state == .VALUE) {
                parsing_state = .KEY;
                continue;
            }
        },

        // A sequence is esentially an array
        .YAML_SEQUENCE_START_EVENT => {
            parsing_state = .KEY;
        },
        .YAML_SEQUENCE_END_EVENT => {
            parsing_state = .KEY;
        },

        // A mapping is esentially an object
        .YAML_MAPPING_START_EVENT => {
            parsing_state = .KEY;
        },
        .YAML_MAPPING_END_EVENT => {
            parsing_state = .KEY;
        },
    };
}

YAML_SEQUENCE_START_EVENT, YAML_SEQUENCE_END_EVENT, YAML_MAPPING_START_EVENT and YAML_MAPPING_END_EVENT

Yaml can have arrays, and YAML_SEQUENCE events represent the start and end of an array.

Yaml can also have nested objects, and YAML_MAPPING events represent the start and end of objects.

I have added the events into the example from earlier to represent when they are being emitted:

#YAML_DOCUMENT_START_EVENT
---
key: value
key2:
#YAML_MAPPING_START_EVENT
  key3: value2
#YAML_MAPPING_END_EVENT
#YAML_DOCUMENT_END_EVENT
#YAML_DOCUMENT_START_EVENT
---
key4: value3
key5:
#YAML_SEQUENCE_START_EVENT
  - some value 1
  - some value 2
#YAML_SEQUENCE_END_EVENT
#YAML_DOCUMENT_END_EVENT

These events really represent depth in a document. On a START event, you are really moving one element deeper in the nested data structure. Inversely, END represents moving one element up in the data structure.

As the consumer of libyaml, it is up to you to bake those semantics into your implementation.

YAML Parsing Steps

1. Read Bytes

There are a variety of different ways that you can get your hands on some bytes. They are free all over the internet!

For YAML parsing, you need your bytes to look and smell like YAML. For this project, I just wrote a YAML file. You can get them by setting up an HTTP server and receiving a post.

However you choose to do it, you need a YAML document of some kind.

For my purposes, I am using the const bytes = @embedFile("./file.yaml"); builtin function in Zig. This tells the complier to read the file and store them statically inside of the application. This is perfectly fine for the go generate pattern that I am after.

2. Iterate Over Events

This is where things start to get interesting.

There are really 2 approaches to the steps of Iterate Over Events and Hydrate Data Structure:

  1. Create and Intermediate Data Structure: For my purposes, this is the way I went. I created a structure called YamlNode which both represents the current node being worked on and any of its child nodes when depth is concerned. The primary drawback to this approach is having additional memory allocations and memory management. The primary upside is that it makes debugging SUPER simple. I would never use this approach on a high performance system because of the overhead, but for my MVP, this is perfect.
  2. Hydrate In-Place: As you are stepping through the events, you can couple that with the structure you are trying to hydrate. The structure should exactly represent the YAML you are parsing so stepping through should be fairly trivial. This also avoids all of the additional memory allocations that I talked about in the other approach. The primary down side is simply this: What do you do with type or structure mismatches? Do you keep on trucking and return what you can with logged errors or do you panic your application/parser?

The intermediate data structure

I chose approach 1 for the MVP because I am more interested in flexibility, visibility, and debug-ability than I am in raw performance.

Stripping away all of the methods, a YamlNode looks like this:

const YamlNode = struct {
    parent: ?*YamlNode,              // Tracks the parent node if one exists
    node_type: YamlNodeType,         // Sets the data type of the node, STRING, INT, FLOAT, BOOL, ect.
    key: ?[]u8,                      // Yaml is a series of K/V, and this tracks the key
    value: ?[]u8,                    // The raw parsed value in bytes
    nodes: std.ArrayList(*YamlNode), // An array of all child nodes
}

The only YamlNode without a parent is the root node. That node will get a node_type of .DOCUMENT.

Otherwise, this structure allows me to iterate up and down the stack creating and attaching nodes together for an in-memory representation of the raw YAML bytes.

The only special case are arrays. I have a special node_type called .ARRAY_ELEMENT which allows me to collect all of the key/value pairs together inside of a generic container. Without the .ARRAY_ELEMENT container, arrays would appear in this data structure as objects with a flattened key/value set making hydration impossible.

Generics

One other thing you may notice when looking at the code is the HEAVY use of generics. Generics allow me to write one set of code which then is compiled for any type that gets passed into the generic code.

pub fn YamlParser(comptime T: type) type {}

Generics use a special Zig keyword of comptime. This means that the compiler must know the type or types being passed into the function at compile time (not runtime). Being a strongly typed language, this is generally pretty easy to accomplish. The only real footgun here is deriving a type based off runtime values. If you are trying to perform an if statement based off user input to determine a type to pass into a generic function, that is purely a no-go without a bunch of additional reflection. The types must be known types while being compiled. This is a bit different than Go because Go is a garbage collected language. The allocations and memory layout can be figured out on the fly so typing doesnt need to be concretely known. In Zig, you manage your own memory so all comptime types must be known at... well... comptime :)

Imagine this example:

pub fn YamlParser(comptime T: type) type {}
fn main() void {
    var parser1 = YamlParser(CONCRETE_TYPE_1);
    var parser2 = YamlParser(CONCRETE_TYPE_2);
}

Using generics at comptime essentially transforms the above into this when compiling:

var parser1 = struct {
    parse() CONCRETE_TYPE_1 {...}
}
var parser2 = struct {
    parse() CONCRETE_TYPE_2 {...}
}

The logic inside of the parser code stays exactly the same, but the types are transformed from generic types to concrete types in a one to many relationship (one function with many types). From there, the compiler can run all of the safety checks and compilation steps on the now concrete implementation.

You can find out a whole lot more about generics here. I am not going to pretend to be an expert when real experts exist!

The result

The result of all of this is a function that will loop over all events and create an internal tree structure of YamlNode structures for use in hydration. Nesting is setup inside of the YamlNodes to help define the depth of the document which is useful in the hydration step below.

Using our YAML example from before, here is a representative example of what the YamlNode structure looks like leaving the Iterate Over Events stage:

---
key: value
key2:
  key3: value2

The above YAML is transformed into the struct layout below: (some fields are omitted for brevity sake)

var nodes = YamlNode{
    .node_type = .DOCUMENT,
    .nodes = []YamlNode{
        {
            .key = "key",
            .value = "value"
            .node_type = .STRING
        },
        {
            .key = "key2",
            .node_type = .OBJECT
            .nodes = []YamlNode{
                {
                    .key = "key3",
                    .value = "value2",
                    .node_type = .STRING
                }
            }
        }
    }
}

3. Hydrate Data Structure

Finally, the fun part. We have some YAML sitting in an internal tree structure that can be traversed to hydrate a known data structure for use in our application. I am going to stick with the YAML example we have been using thus far with a struct below that represents the data structure we are trying to hydrate:

const Data = struct {
    key: []u8 = undefined,
    key2: SubData = undefined 
}

const SubData = struct {
    key3: []u8 = undefined,
}

There are a few interesting facets of a nested data structure like this:

  1. It is nested, so we will need to not only know what the containing data structure looks like, but we will need to know what all of that structures children look like.
  2. Zig does not have strings, just arrays of unsigned integers (8 bit unsigned ints in this case, see my previous note about UTF16). Since they are unknown size at compile time, we will be forced into doing some memory allocations.
  3. Writing a specialized function to hydrate this structure means that we would need to write a specialized function for every data structure which violates my core principal of being as lazy as possible.

Enter Zig Generics again!

Let's deal with the allocations first

A programming language like Go is going to be a garbage collected language. That means that under the covers, Go is performing memory allocations on your behalf. Once your data structures or vairables loose scope and reference in the application, Go will automagically clean your memory up for you.

Yes, I realize it is not really magic. There is a TON of documentation to find online about how this magic really works.

Zig is NOT a garbage collected language meaning you are responsible for all of your allocations and deallocations. Because we have "strings" of unknown size that will require allocations that a user of this library will need to know about and subsequently deallocate, hydration will require one more wrapping generic that will incorporate a deinit function to deinitialize any memory created as part of the hydration process.

pub fn DeserializedYaml(comptime T: type) type {
    return struct {
        arena: std.heap.ArenaAllocator,
        value: T,
        messages: std.ArrayList([]u8),

        const Self = @This();

        pub fn deinit(self: *Self) void {
            self.arena.deinit();
        }
    };
}

Above, we have another generic that will take in any type, in this case it is the Data struct type, with an attribute to track the ArenaAllocator, the value which is our type, and messages which tell us about any errors encountered during the hydration process.

Next, Let's deal with the generic function

Writing a generic hydrator heavily relies on Zig's built in comptime reflection system.

Code reflection is a programming concept that involves a program inspecting, analyzing, and sometimes modifying its own structure and behavior during runtime. It is commonly used to achieve dynamic behavior in applications.

In this case we are using reflection for code introspection to examine the code's own structure, methods, and properties.

fn WalkDeserializeToStruct(comptime T: type) type {
    return struct {
        node: *YamlNode,
        ...

        const Self = @This();

        pub fn walk(self: *Self) !T {
            switch (@typeInfo(T)) {
                .array => |_| {
                    ...
                },
                .bool => {
                    ...
                },
                .float => {
                    ...
                },
                .int => {
                    ...
                },
                .pointer => |p| {
                    switch (p.size) {
                        .One => {
                            ...
                        },
                        .Slice => {
                            ...
                        },
                        else => |e| {
                            unreachable;
                        },
                    }
                },
                .@"struct" => |s| {
                    ...
                },
                else => |_| {
                    unreachable;
                },
            }
        }
    };
}

This function above takes in a type, in this case Data, and will recursively walk down both the YamlNodes and Zig type to hydrate a data structure.

The major hurdle to jump over when doing this kind of hydration is comptime. Normally if I were to go about filling in a structure like this, I would take in the YamlNode, evaluate the key, run a for loop using reflection to look for a matching structure key, then recursively re-call this function with that information to allocate and set the field value.

Because this all needs to be evaluated in comptime, the code has a few interesting cases of what appear to be extreemly inefficient for loops to accomplish this. We will look at the primary .@"struct" case:

fn WalkDeserializeToStruct(comptime T: type) type {
    return struct {
        nodes: []*YamlNode,
        node: *YamlNode,
        ...

        const Self = @This();

        pub fn walk(self: *Self) !T {
            switch (@typeInfo(T)) {
                ...
                .@"struct" => |s| {
                    const r: *T = try self.alloc.create(T);
                    r.* = T{};
                    for (self.nodes) |current_node| {
                        self.node = current_node;

                        inline for (s.fields) |field| {
                            if (std.mem.eql(u8, field.name, self.node.key)) {
                                var next = try walk_and_deserialize_to_struct(self.alloc, self.node, field.type);
                                @field(r.*, field.name) = try next.walk();
                            }
                        }
                    }
                },
            }
        }
    };
}

I will walk through what the code is doing here step by step:

  1. The first things you see is .@"struct" => |s|. This is saying, if the generic type we got in is a "struct", then inside of the scope of the parens, capture it to the var s for future use.

  2. Next is a constant assignment const r: *T = try self.alloc.create(T);. This is creating a return variable called r which is a type of T where T was passed in as a comptime var to this anonymous struct.

  3. Next, you see r.* = T{}. This is dereferencing the internal pointer that r is carrying and assigning it to T{} which is how a memory layout gets initialized in Zig.

  4. Next we are going to loop over all of the nodes (YamlNode) to start the hydration process.

  5. Next you can see an inline for statement. In C you will have macros which essentially let you write code templates that the compiler will then perform replacements in your code with your macros. inline is Zig's way of accomplishing the same thing. It is saying, for each field on the struct duplicate the code in the compiler for each field more or less as a series of if statements. The for statement ends in |field| which again is a capture group for everything inside of the next set of parans to be able to access the field.

  6. Next, there is this odd if statement if (std.mem.eql(u8, field.name, self.node.key)) {. Because Zig does not have the concept of strings, we are asking Zig to compare 2 different arrays of bytes and tell us if they are equal. If they are, we move onto the next step.

    • This is where things start to look really inefficient. Loops inside of loops inside of loops are a programming 101 no-no when you want code that performs well. Remember the keyword inline from before, it is essentially unrolling the loop so it really isn't a loop. From there, Zig is intelligent enough to evaluate the hot paths of code that it really needs and ignore anything it doesn't from the final binary.
  7. Finally, we get to the payoff. We setup a var called next that utilizes a helper function walk_and_deserialize_to_struct which just helps setup the next recursive WalkDeserializeToStruct(comptime T: type) type. From there we call @field(r.*, field.name) = try next.walk(); which will take the result of the recursion and assign it to the return variables field.

    • But hold on there buddy, what is @field? @field is a built in Zig refleciton function that gets a specific field from a struct. Once you have the field, you can either use the value (if you know or reflect the type), or set data to the value.
    • Why not just use self.node.key instead of field.name?? This is where comptime comes into play. The compiler can understand what field.name means because those are all internal concepts to the compiler. self.node.key came from userland and the compiler has no idea what is going on with that var thus it violates comptime. This is the reason for the inline for and the seemingly inefficient loops.

The Result

This recursive approach to hydrating a data structure means that there is now an anonymous function that will take in any data type, reflect its way through the fields, try to match that up with the YamlNode tree structure to produce a hydrated set of data that can be used by the application.

Neat Right!

There is still a ton of depth that I want to write about here since I am only scratching the surface. Feel fee to check back later for updates!

Conclusion

Aside from my professional career of writing way too much code way too fast, this is my first time throwing some code up online after learning:

  1. A new programming language
  2. How YAML really works
  3. What do when there is no garbage collector to save me
  4. Real reflection aside from just using simple .instaceOf in other languages.

Normally when doing this professionally, I would open a PR and ask for feedback. In this instance, I am not able to do that, and I threw the code up on Reddit for review instaed.

Zig YAML parser
byu/wageof inZig

The response was good and one kind soul was even willing to provide a high level review for me.

Even though this was a side project. I did learn a lot from it. I think everyone should take time out and go down the road less traveled to see how things are really working under the covers.