Shambles Software
How to write a JSON-like parser with templates

I’m writing this on the off-chance that you despise yourself and want to become riddled with templates, but also parse some JSON-like data at the same time.

wat why??

Well, there are a number of reasons you might want to do this. Here’s a breakdown of my motivations:

  • I needed to import data structures of various kinds in to my latest project and I am not a fan of writing different parsers or layers on top of loosely typed parsers. I wanted something strong and powerful like a muscleman.
  • A parser that exists entirely inside a type system is pleasurably extendible. Any additional features get blasted back to all existing parsers.
  • This kind of stuff tickles my fancy.

That’s all the motivation I needed to get my parse on. I also updated my compiler/build to use C++11 because this sort of thing is deeply unpleasant without variadic templates.

Let’s tokenize!

The first step to parsing is to tokenize. There are plenty of existing tokenizers out there, but that’s irrelevant because there are also plenty of existing DSLs out there and all this has been done before better. So don’t think about that and write a tokenizer. It’s incredibly easy! Here’s what I did, roughly.

class Tokenizer
{
  // private member variables

public:
  Tokenizer(const std::string& text);
  ~Tokenizer();

  void skipWhitespace(); // moves to next token

  bool accept(const std::string& token);
  void expect(const std::string& token); // exception if token not found

  template<typename T>
  T extractToken()
  {
    // interpret the next token as type T or exception
  }
};

So far, so good. The tokenizer has a read point over the text, which is given in the constructor. ‘skipWhitespace’ moves the read point to the next interesting thing after all the spaces, tabs, returns, comments, etc… ‘accept’ checks if the next token matches a string and if it does, it moves the read point past it and returns true. ‘expect’ calls ‘accept’ then potentially throws a hissy fit.

'extractToken' interprets the tokens as types. I simply made it pull out the token as a string (by calling extractToken<std::string>) and then used std::istringstream to pull a type out. That's a reasonably safe way of handling rubbish input, and it's always possible to make it strong and aggressive later.

For tokens, I simply used string matching, but it’s entirely plausible to stuff enums in here for different token types and add some tools or supporting classes for matching tokens.

Wrapping the tokenizer up

It’s great we have a tokenizer (woo) but don’t start honking your own horn yet. We’ll need to create a type that uses the tokenizer. Why? Because this is exactly the sort of thing you do with templates.

template<typename TYPE>
struct Element
{
  typedef TYPE target_type;

  static void parse(TYPE& target, Tokenizer& t)
  {
    target = t.extractToken<TYPE>();
  }
};

Ooookay, that seems kind of pointless, but it’s not. We can now declare types which are parsers which can parse most built-in types from text. YES. So Element<int> is a parser which sets an int to the contents of some text which has an int in it, uh… Look, it’ll all come together when this stuff is combined in to bigger parts. Which is next.

Data structures

Parsing a data structure is a little less straightforward. You have a struct type, it has members each of which have their own types. That means they have their own parsers, whether they are Element<T> type, parsers for other structures or something else. I decided to use two types for this: Structure, to describe a struct in terms of its members, and Member to encapsulate one member of a struct. Let’s take a look at Member first.

template<typename CLASS,
         typename TYPE,
         TYPE CLASS::*MEMBER,
         typename PARSER = Element<TYPE>>
struct Member
{
  static void parseMember(CLASS& target, Tokenizer& t)
  {
    TYPE& type = target.*MEMBER;
    PARSER::parse(type, t);
  }
};

You can use a pointer to member variable as a template parameter because at its heart it is just an offset. It’s typed in two ways, though. To the class and to the type of the member variable. That means the Member template parameters are quite bulky, but don’t worry, we can macro the pants off it later.

I have not included a typedef for the CLASS parameter, and the parse function is called ‘parseMember’. This doesn’t fit the same pattern as Element. That’s because this template can’t be at the top level of a parser, it has to belong to a Structure, so I’ve deliberately phrased it a little different.

Note that by default it assumes the parser is going to be an Element<T> type.

And let’s see what a structure looks like…

template<typename TARGET,
         typename... MEMBERS>
class Structure
{
  typedef TARGET target_type;

  static void parse(TARGET& target, Tokenizer& t)
  {
    // ... implementation ...
  }
};

Okay, I left out most of the implementation because I’m busting out the variadic templates now and it might be a good time to recap how this stuff will be used. Here’s an example of creating a parser for a struct:

struct Example
{
  int i;
  double d;
};

typedef Structure<Example,
                  Member<Example, int, &Example::i>
                  Member<Example, double, &Example::d>
                 > ExampleParser;

That typedefs a little parser type that will be able to interpret something like “{2, 5.3}” and plop it right in to the struct. And of course, ExampleParser can be used as a fourth parameter to Member so structs containing Example are eligible for parsers too. The power! It’s growing!

Let’s get our hands a bit dirtier. How might the Structure template work internally? Thanks to variadic templates, it’s actually quite a pleasing experience.

template<typename TARGET,
         typename... MEMBERS>
class Structure
{
  template<typename MEMBER>
  static void parseMember(TARGET& target, Tokenizer& t)
  {
    MEMBER::parseMember(target, t);
  }

  template<typename M1, typename M2, typename... OTHER_MEMBERS>
  static void parseMember(TARGET& target, Tokenizer& t)
  {
    parseMember<M1>(target, t);
    t.expect(",");
    parseMember<M2, OTHER_MEMBERS...>(target, t);
  }

public:
  typedef TARGET target_type;

  static void parse(TARGET& target, Tokenizer& t)
  {
    t.expect("{");
    parseMember<MEMBERS...>(target, t);
    t.expect("}");
  }
};

That’s deceptively simple. The key part is that if you call a variadic template with the parameter list as an empty list, it makes itself magically disappear. The Structure expects ‘{’ and ‘}’ around the members, and expects a comma between any two members. Fantastic.

A little bit more

The next step would be arrays. They take a form like “[6, 0, 3, 1, 7, 6, 9]”, with an unknown number of elements of matching type. They’ll need to go in to a std::vector or a std::deque or something, so the Array parser will have to be container agnostic. Each container is a different type, but so are our parsers so that kind of works okay. Usage of ‘accept’ and ‘expect’ from the tokenizer makes this task fairly sane.

template<typename TARGET,
         typename PARSER = Element<typename TARGET::value_type>>
struct Array
{
  typedef TARGET target_type;

  static void parse(TARGET& target, Tokenizer& t)
  {
    t.expect("[");
    if (t.accept("]"))
      return; // empty array

    do
    {
      typename TARGET::value_type value;
      PARSER::parse(value, t);
      target.push_back(value);
    } while (t.accept(","));

    t.expect("]");
  }
};

No variadic magic this time, because we don’t know how many elements there will be at compile time, we have to tokenize it and find out. The container has to support the standard push_back function and have the value_type typedef, which all the standard containers do.

That’s enough of this. I made a few more different parser components for myself, but I’m not going to go over them. I made a Nullable component that would accept “null” for a pointer and make it zero instead of parsing a target. I also made an associative container type which uses JSON-style ‘key:value’ syntax. It’s very similar to the Array type.

I’ll leave those as exercises for the reader (meaning: this is already long enough).

This stuff is unusable

Yes, the shame of it. As anyone could have guessed, building a parser out of nested types makes a massively unwieldy mess of types. How can this be cleaned up? And how do we even use this in a sensible way? I’ll answer the second question first.

void resourceToText(std::string& text, const std::string& resource);

template<typename PARSER>
bool parse(typename PARSER::target_type& target, const std::string& resource)
{
  std::string text;
  resourceToText(text, filename);

  Tokenizer t(text);

  try
  {
    PARSER::parse(target, t);
  }

  catch ( /* my exception */ )
  {
    // Output the exception details showing where the parsing failed
    return false;
  }

  return true;
}

That’s quite simple. Basically just pull out the target_type from the parser and combine it with whatever resource loading you do. The exceptions the tokenizer can throw can be dealt with sensibly. Now to parse a structure, just call straight to parse<ParseType>(myType, “data/myType.blah”); and you’re done.

The only remaining problem is how to typedef the parser types without becoming hopelessly lost in the depths of template hell. Well, I saved the worst for last: Jam everything full of macros and pretend nobody saw.

Okay, so there’s actually a pretty clever trick you can do with pretend functions and C++11’s decltype operator to decompose a pointer to member in to its respective class and type. Let’s have a quick peek at that.

template<typename CLASS, typename TYPE>
CLASS extractClass(TYPE CLASS::*);

template<typename CLASS, typename TYPE>
TYPE extractType(TYPE CLASS::*);

#define MEMBER(x) Member<decltype(extractClass(x)), decltype(extractType(x)), x>
#define MEMBER2(x, y) Member<decltype(extractClass(x)), decltype(extractType(x)), x, y>
#define ARRAY(x) Member<decltype(extractClass(x)), decltype(extractType(x)), x, Array<decltype(extractType(x))>>
#define ARRAY2(x, y) Member<decltype(extractClass(x)), decltype(extractType(x)), x, Array<decltype(extractType(x)), y>>

Yikes. But this means instead of doing Member<Example, int, &Example::i> we can just do MEMBER(&Example::i). And instead of Member<Example, ExampleChild, &Example::child, ChildParser>, a simple MEMBER2(&Example::child, ChildParser) will suffice. The Array versions work similarly. I’m sure that was perfectly clear.

That makes our typedefs a lot cleaner. Here’s a real (fake) example of one I used.

struct ParticleFrame
{
  int frame;
  int delay;
};

typedef Structure<ParticleFrame,
                  MEMBER(&ParticleFrame::frame),
                  MEMBER(&ParticleFrame::delay)
                 > ParticleFrameParser;

struct Particle
{
  int delayMultiplier;
  std::vector<ParticleFrame> frames;
};

typedef Structure<Particle,
                  MEMBER(&Particle::delayMultiplier),
                  ARRAY2(&Particle::frames, ParticleFrameParser)
                 > ParticleParser;

// later on..
  Particle particle;
  if (parse<ParticleParser>(particle, assetFilename))
    d->particles.push_back(particle);

I think that’s pretty sweet, actually.