Over the past few months, I have started to learn Zig. As part of another side project that I am working on, I discovered that I needed to create a bunch of data structures to hold in-flight data. Now, there are a few ways I could have gone about doing this!
- Probably the easiest... Just write the code! The problem is, I am lazy and I didnt want to just write a bunch of data structures for the hell of it. Also, what do I really learn if I just write code?
- Lean on the
go generate
pattern and define my data structures in some kind of markup language and then write Zig code to generate my data structures as part of my build process.
Number 2 sounds like the winner to me. Write a bunch of code to generate code that I don't want to write. I AM GAME!
Related Articles
- Convert Text File To YAML With Ansible
- What Are Kubernetes Jobs And Cron Jobs?
- Convert Text File To JSON With Ansible
So, Why Did I NEED to Write a YAML Parser??
Zig's stdlib
has a JSON parser built right in. So, why not use JSON then? As I stated earlier, I am a bit lazy and JSON has far too many characters to type. Would JSON have been easier?
Probably!
The truth is, I have done a lot of text parsing in my day. I have never actually looked into what it takes to parse a structured markup file like YAML or JSON. I get the basic concepts of structured markup, but I have always relied on pre-built libs to perform the actually deserializing with data structure hydration. I have never dug into the real nuts and bolts of what is happening under the covers.
In the spirit of learning, which is why I took up Zig to begin with, I decided to make a HARD right turn away from easy and dive into parsing YAML from scratch-ish.
Below you will find a high level guide on YAML parsing that fit my specific use cases. There will be some holes as this entire endeavour was meant to build an MVP, which fit my specific use case.
The Caveats
I have done a fair amount of data parsing and tokenizing in my career, and there is only so much pain I am willing to bear on this "side" project. I don't need to go so low level that I am accumulating and parsing bytes because I have 10,000 hours of that under my belt already.
The reason I chose Zig to begin with is its interoperability with C. C is where ALL of the fun stuff really happens. You would not have Golang, C++, PHP, Linux, or anything else without C.
I chose to use libyaml, which is built and maintained by yaml.org, to do the data parsing. This may get swapped to libfyaml in the future for various spec related reasons that are well outside of the scope of this article.
Why Not Rust?
Because...
How Does YAML Parsing Work?
There is a whole spec you can read, but who has time for that!
The nuts and bolts of YAML parsing are these events:
const YamlEvents = enum(c_int) {
YAML_NO_EVENT,
YAML_STREAM_START_EVENT,
YAML_STREAM_END_EVENT,
YAML_DOCUMENT_START_EVENT,
YAML_DOCUMENT_END_EVENT,
YAML_ALIAS_EVENT,
YAML_SCALAR_EVENT,
YAML_SEQUENCE_START_EVENT,
YAML_SEQUENCE_END_EVENT,
YAML_MAPPING_START_EVENT,
YAML_MAPPING_END_EVENT,
};
Each of these is an event that is emitted by the official C library that tokenizes YAML for parsing. We will walk through each of the events to get a better understanding of what is happening under the covers.
YAML_NO_EVENT
This is probably fired when the byte stream fed into the YAML library is empty.
I don't really know what this does, so let's move on :)
YAML_STREAM_START_EVENT and YAML_STREAM_END_EVENT
These events denote the start and end of a byte stream. Normally, when you look at some text in a text editor or a webpage, we may forget that bits and bytes are actually representing the data under the covers. The parser reads the data byte by byte and it needs to emit an event when it reads the first byte and when it finishes the last byte.
YAML_DOCUMENT_START_EVENT and YAML_DOCUMENT_END_EVENT
These events denote the start and end of a document. A YAML byte stream should start with ---
as the first three bytes. When those characters are seen, the parser will emit a YAML_DOCUMENT_START_EVENT
. If the ---
characters are seen again the same byte stream, a YAML_DOCUMENT_END_EVENT
will be emitted followed immediately by another YAML_DOCUMENT_START_EVENT
.
Below is a YAML document annotated with the events commented inline:
#YAML_DOCUMENT_START_EVENT
---
key: value
key2:
key3: value2
#YAML_DOCUMENT_END_EVENT
#YAML_DOCUMENT_START_EVENT
---
key4: value3
key5:
- "some value 1"
- "some value 2"
#YAML_DOCUMENT_END_EVENT
Essentially, these events break up multiple "files" or "documents" that all can be combined into a single byte stream.
YAML_ALIAS_EVENT
I did not implement this event in my parser so I am not going to offer any explaination of the specific event here.
A YAML alias esentially lets you lookup an Anchor somewhere else in your YAML file to reduce data duplication. I may update my parser in the future to support this, at which point I will update this article to reflect that change.
YAML_SCALAR_EVENT
A scalar is a single simple data value that represents a single piece of information. In the context of this parser, and more specifically Zig land, this is a [*]u8
value. What is a [*]u8
value?? Why is there a pointer inside of the array?
This value comes from C land. It is common in C to pass both an array of pointers along with a length to functions. The function will read the length and subsequently iterate through the pointers using the length as a backstop. This needs to be translated from C to Zig so that Zig knows how to deal with a multi-pointer char.
Zig does not have the idea of strings built in and all characters are represented by an unsigned integer. In this case it is an 8bit unsigned integer or a u8
.
This means that the scalar value is an array of 8bit return values. This gets much more complex when you are dealing with a UTF16 character set, but this MVP implementation is only accounting for UTF8 because IDK how to make my keyboard type anything else.
When a scalar event shows up, again ignoring YAML_ALIAS
events for now, it means that you are either getting a key or a value. As the developer, it is up to you to write the logic that knows when you are getting a key or value. I did this with a simple ENUM flip flop.
const YamlCurrentlyParsing = enum {
KEY,
VALUE,
};
// flipidy flop
var parsing_state: YamlCurrentlyParsing = .KEY;
Depending on the structure of your file, you may get multiple YAML_SCALAR
events in a row. In that case, inside of the scalar event, I will first read a key and then flip parsing_state
to a value. Once I read a value, I will flip parsing_state
back to key.
There are times when you will read a key and then immediately get a sequence or mapping event. The flip flop would have already changed to value, but for those specific events I move the flip flop back to key knowing that I am reading an array or object.
Here is a simple stripped down example below:
while(true) {
switch (@as(YamlEvents, @enumFromInt(event.type))) {
// removed extra events for readability
.YAML_SCALAR_EVENT => {
if (parsing_state == .KEY) {
parsing_state = .VALUE;
continue;
}
if (parsing_state == .VALUE) {
parsing_state = .KEY;
continue;
}
},
// A sequence is esentially an array
.YAML_SEQUENCE_START_EVENT => {
parsing_state = .KEY;
},
.YAML_SEQUENCE_END_EVENT => {
parsing_state = .KEY;
},
// A mapping is esentially an object
.YAML_MAPPING_START_EVENT => {
parsing_state = .KEY;
},
.YAML_MAPPING_END_EVENT => {
parsing_state = .KEY;
},
};
}
YAML_SEQUENCE_START_EVENT, YAML_SEQUENCE_END_EVENT, YAML_MAPPING_START_EVENT and YAML_MAPPING_END_EVENT
Yaml can have arrays, and YAML_SEQUENCE
events represent the start and end of an array.
Yaml can also have nested objects, and YAML_MAPPING
events represent the start and end of objects.
I have added the events into the example from earlier to represent when they are being emitted:
#YAML_DOCUMENT_START_EVENT
---
key: value
key2:
#YAML_MAPPING_START_EVENT
key3: value2
#YAML_MAPPING_END_EVENT
#YAML_DOCUMENT_END_EVENT
#YAML_DOCUMENT_START_EVENT
---
key4: value3
key5:
#YAML_SEQUENCE_START_EVENT
- some value 1
- some value 2
#YAML_SEQUENCE_END_EVENT
#YAML_DOCUMENT_END_EVENT
These events really represent depth in a document. On a YAML_MAPPING_START
event, you are really moving one element deeper in the nested data structure. Inversely, YAML_MAPPING_END
represents moving one element up in the data structure.
As the consumer of libyaml, it is up to you to bake those semantics into your implementation.
YAML Parsing Steps
1. Read Bytes
There are a variety of different ways that you can get your hands on some bytes. They are free all over the internet!
For YAML parsing, you need your bytes to look and smell like YAML. For this project, I just wrote a YAML file. You can get them by setting up an HTTP server and receiving a post.
However you choose to do it, you need a YAML document of some kind.
For my purposes, I am using the const bytes = @embedFile("./file.yaml");
builtin function in Zig. This tells the complier to read the file and store them statically inside of the application. This is perfectly fine for the go generate
pattern that I am after.
2. Iterate over Events
This is where things start to get interesting.
There are really two approaches to the steps of Iterate Over Events
and Hydrate Data Structure
:
- Create an Intermediate Data Structure: For my purposes, this is the way I went. I created a structure called
YamlNode
, which represents both the node currently being worked on and any of its child nodes when depth is concerned. The primary drawback to this approach is having additional memory allocations and memory management. The primary upside is that it makes debugging SUPER simple. I would never use this approach on a high-performance system because of the overhead, but for my MVP, this is perfect. - Hydrate In Place: As you are stepping through the events, you can couple that with the structure you are trying to hydrate. The structure should represent the YAML you are parsing exactly, so stepping through should be fairly trivial. This also avoids all of the additional memory allocations that I talked about in the other approach. The primary downside is simply this: What do you do with type or structure mismatches? Do you keep on trucking and return what you can with logged errors or do you panic your application/parser?
The Intermediate Data Structure
I chose the first approach for the MVP because I am more interested in flexibility, visibility, and debuggability than I am in raw performance.
Stripping away all of the methods, a YamlNode looks like this:
const YamlNode = struct {
parent: ?*YamlNode, // Tracks the parent node if one exists
node_type: YamlNodeType, // Sets the data type of the node, STRING, INT, FLOAT, BOOL, ect.
key: ?[]u8, // Yaml is a series of K/V, and this tracks the key
value: ?[]u8, // The raw parsed value in bytes
nodes: std.ArrayList(*YamlNode), // An array of all child nodes
}
The only YamlNode
without a parent is the root node. That node will get a node_type
of .DOCUMENT
.
Otherwise, this structure allows me to iterate up and down the stack creating and attaching nodes together for an in-memory representation of the raw YAML bytes.
The only special case relates to arrays. I have a special node_type called .ARRAY_ELEMENT
, which allows me to collect all of the key/value pairs together inside of a generic container. Without the .ARRAY_ELEMENT
container, arrays would appear in this data structure as objects with a flattened key/value set making hydration impossible.
Generics
One other thing you may notice when looking at the code is the HEAVY use of generics. Generics allow me to write one set of code, which then is compiled for any type that gets passed into the generic code.
pub fn YamlParser(comptime T: type) type {}
Generics use a special Zig keyword of comptime
. This means that the compiler must know the type or types being passed into the function at compile time (not runtime). Being a strongly-typed language, this is generally pretty easy to accomplish. The only real footgun here is deriving a type based from runtime values. If you are trying to perform an if
statement based off user input to determine a type to pass into a generic function, that is purely a no-go without a bunch of additional reflection. The types must be known types while being compiled. This is a bit different than Go because Go uses garbage collection to manage memory allocation. With garbage collection, allocations and memory layout can be figured out on the fly, so typing doesn't need to be known concretely. In Zig, you manage your own memory so all comptime
types must be known at... well... comptime :)
Imagine this example:
pub fn YamlParser(comptime T: type) type {}
fn main() void {
var parser1 = YamlParser(CONCRETE_TYPE_1);
var parser2 = YamlParser(CONCRETE_TYPE_2);
}
Using generics at comptime essentially transforms the above into this when compiling:
var parser1 = struct {
parse() CONCRETE_TYPE_1 {...}
}
var parser2 = struct {
parse() CONCRETE_TYPE_2 {...}
}
The logic inside of the parser code stays exactly the same, but the types are transformed from generic types to concrete types in a one-to-many relationship (one function with many types). From there, the compiler can run all of the safety checks and compilation steps on the now concrete implementation.
You can find out a whole lot more about generics here. I am not going to pretend to be an expert when real experts exist!
The Result
The result of all of this is a function that will loop over all events and create an internal tree structure of YamlNode
structures for use in hydration. Nesting is setup inside of the YamlNode
s to help define the depth of the document, which is useful in the hydration step below.
Using our YAML example from before, here is a representative example of what the YamlNode
structure looks like leaving the Iterate Over Events
stage:
---
key: value
key2:
key3: value2
The above YAML is transformed into the struct layout below: (some fields have been omitted for brevity's sake)
var nodes = YamlNode{
.node_type = .DOCUMENT,
.nodes = []YamlNode{
{
.key = "key",
.value = "value"
.node_type = .STRING
},
{
.key = "key2",
.node_type = .OBJECT
.nodes = []YamlNode{
{
.key = "key3",
.value = "value2",
.node_type = .STRING
}
}
}
}
}
3. Hydrate Data Structure
Finally, the fun part. We have some YAML sitting in an internal tree structure that can be traversed to hydrate a known data structure for use in our application. I am going to stick with the YAML example we have been using thus far with a struct below that represents the data structure we are trying to hydrate:
const Data = struct {
key: []u8 = undefined,
key2: SubData = undefined
}
const SubData = struct {
key3: []u8 = undefined,
}
There are a few interesting facets of a nested data structure like this:
- It is nested, so we will need to know not only what the containing data structure looks like, but also what all of that structure's children look like.
- Zig does not have strings; just arrays of unsigned integers (in this case 8 bit unsigned ints -- see my previous note about UTF16). Since their sizes are unknown at compile time, we will be forced to perform some memory allocations.
- Writing a specialized function to hydrate this structure means that we would need to write a specialized function for every data structure, which violates my core principal of being as lazy as possible.
Enter Zig Generics again!
Let's Deal with the Allocations First
A programming language like Go is going to be a garbage-collected language. That means that under the covers, Go is performing memory allocations on your behalf. Once your data structures or variables lose scope and reference in the application, Go will automagically clean your memory up for you.
Yes, I realize it is not really magic. There is a TON of documentation to find online about how this magic really works.
Zig is NOT a garbage-collected language, meaning you are responsible for all of your allocations and deallocations. Because we have "strings" of unknown size that will require allocations that a user of this library will need to know about and subsequently deallocate, hydration will require one more wrapping generic that will incorporate a deinit function to deinitialize any memory created as part of the hydration process.
pub fn DeserializedYaml(comptime T: type) type {
return struct {
arena: std.heap.ArenaAllocator,
value: T,
messages: std.ArrayList([]u8),
const Self = @This();
pub fn deinit(self: *Self) void {
self.arena.deinit();
}
};
}
Above, we have another generic that will take in any type, in this case it is the Data
struct type, with an attribute to track the ArenaAllocator
, the value (which is our type), and messages (which tell us about any errors encountered during the hydration process).
Next, Let's Deal with the Generic Function
Writing a generic hydrator heavily relies on Zig's built in comptime reflection system.
Code reflection is a programming concept that involves a program inspecting, analyzing, and sometimes modifying its own structure and behavior during runtime. It is commonly used to achieve dynamic behavior in applications.
In this case we are using reflection for code introspection to examine the code's own structure, methods, and properties.
fn WalkDeserializeToStruct(comptime T: type) type {
return struct {
node: *YamlNode,
...
const Self = @This();
pub fn walk(self: *Self) !T {
switch (@typeInfo(T)) {
.array => |_| {
...
},
.bool => {
...
},
.float => {
...
},
.int => {
...
},
.pointer => |p| {
switch (p.size) {
.One => {
...
},
.Slice => {
...
},
else => |e| {
unreachable;
},
}
},
.@"struct" => |s| {
...
},
else => |_| {
unreachable;
},
}
}
};
}
This function above takes in a type, in this case Data
, and will recursively walk down both the YamlNode
s and Zig type to hydrate a data structure.
The major hurdle to jump over when doing this kind of hydration is comptime
. Normally, if I were to go about filling in a structure like this, I would take in the YamlNode
, evaluate the key, run a for loop using reflection to look for a matching structure key, then recursively re-call this function with that information to allocate and set the field value.
Because this all needs to be evaluated in comptime, the code has a few interesting cases of what appear to be extreemly inefficient for loops to accomplish this. We will look at the primary .@"struct"
case:
fn WalkDeserializeToStruct(comptime T: type) type {
return struct {
nodes: []*YamlNode,
node: *YamlNode,
...
const Self = @This();
pub fn walk(self: *Self) !T {
switch (@typeInfo(T)) {
...
.@"struct" => |s| {
const r: *T = try self.alloc.create(T);
r.* = T{};
for (self.nodes) |current_node| {
self.node = current_node;
inline for (s.fields) |field| {
if (std.mem.eql(u8, field.name, self.node.key)) {
var next = try walk_and_deserialize_to_struct(self.alloc, self.node, field.type);
@field(r.*, field.name) = try next.walk();
}
}
}
},
}
}
};
}
I will walk through what the code is doing here step by step:
-
The first things you see is
.@"struct" => |s|
. This is saying: if the generic type we got in is astruct
, then inside of the scope of the parentheses, capture it to the variables
for future use. -
Next is a constant assignment
const r: *T = try self.alloc.create(T);
. This is creating a return variable calledr
, which is a type ofT
whereT
was passed in as acomptime
var to this anonymous struct. -
Next, you see
r.* = T{}
. This is dereferencing the internal pointer thatr
is carrying and assigning it toT{}
, which is how a memory layout gets initialized in Zig. -
Next, we are going to loop over all of the nodes (
YamlNode
) to start the hydration process. -
Next, you can see an
inline for
statement. In C you will have macros, which essentially let you write code templates that the compiler will then perform replacements in your code with your macros.inline
is Zig's way of accomplishing the same thing. It is saying: for each field on the struct, duplicate the code in the compiler for each field more or less as a series of if statements. The for statement ends in|field|
, which again is a capture group for everything inside of the next set of parans to be able to access the field. -
Next, there is this odd if statement
if (std.mem.eql(u8, field.name, self.node.key)) {
. Because Zig does not have the concept of strings, we are asking Zig to compare two different arrays of bytes and tell us if they are equal. If they are, we move onto the next step.- This is where things start to look really inefficient. Loops inside of loops inside of loops are a programming 101 no-no when you want code that performs well. Remember the keyword
inline
from before: it is essentially unrolling the loop so it really isn't a loop. From there, Zig is intelligent enough to evaluate the hot paths of code that it really needs and ignore anything it doesn't from the final binary.
- This is where things start to look really inefficient. Loops inside of loops inside of loops are a programming 101 no-no when you want code that performs well. Remember the keyword
-
Finally, we get to the payoff. We set up a var called
next
that utilizes a helper functionwalk_and_deserialize_to_struct
, which just helps set up the next recursiveWalkDeserializeToStruct(comptime T: type) type
. From there, we call@field(r.*, field.name) = try next.walk();
, which will take the result of the recursion and assign it to the return variables field.- But hold on there buddy: what is
@field
?@field
is a built in Zig refleciton function that gets a specific field from a struct. Once you have the field, you can either use the value (if you know or reflect the type) or set data to the value. - Why not just use
self.node.key
instead offield.name
?? This is wherecomptime
comes into play. The compiler can understand whatfield.name
means because those are all internal concepts to the compiler.self.node.key
came from consuming code and the compiler has no idea what is going on with that var; thus, it violatescomptime
. This is the reason for theinline for
and the seemingly-inefficient loops.
- But hold on there buddy: what is
The Result
This recursive approach to hydrating a data structure means that there is now an anonymous function that will accept any data type, reflect its way through the fields, and try to match that up with the YamlNode
tree structure to produce a hydrated set of data that can be used by the application.
Neat, Right!?
I am only scratching the surface; there is still a ton of depth that I want to write about here. Feel fee to check back later for updates!
Conclusion
Aside from my professional career of writing way too much code way too fast, this is my first time throwing some code up online after learning:
- A new programming language
- How YAML really works
- What do when there is no garbage collector to save me
- Real reflection aside from just using simple
.instanceOf
in other languages.
Normally, when doing this professionally, I would open a PR and ask for feedback. In this instance, I am not able to do that, and I threw the code up on Reddit for review instaed.
Zig YAML parser
byu/wageof inZig
The response was good and one kind soul was even willing to provide a high level review for me.
Even though this was a side project. I did learn a lot from it. I think everyone should take time out and go down the road less traveled to see how things are really working under the covers.