Parsing JSON

When we last left our hero we had built enough of a recursive descent parser to be able to parse Lisp's S-expressions. In this post I want to show you what I've added to the framework since then. This includes some useful building blocks, including the regex leaf parser; and a demonstration of how easy it is to create a JSON parser with the framework.

As before, you can follow (or participate in) the work on GitHub: https://github.com/ollie-williams/SwiftParsec.

New building blocks

regex

Do you remember how we parsed identifiers in the last post?

func idChar(c:Character) -> Bool {
  switch c {
    case "(", ")", " ":
      return false
    default:
      return true
  }
}
let identifier = many1chars(satisfy(idChar)) ~> skip

Yuck. That is a lot of text to say something very simple.

Regular expressions (or regexes) are a great way to model a wide-variety of syntactical patterns. Not only that, but regular expression matchers are a built-in component of most computing environments; including Swift. While it might not be a great idea (or even possible) to build a complex parser out of just regular expressions, they can be a very handy way of building one up. Accordingly, I have added the Regex leaf parser.

class Regex : Parser {
  typealias Target = String
  let pattern:String
  init(pattern:String) {
    self.pattern = pattern
  }
  func parse(stream:CharStream) -> Target? {
    if let match = stream.startsWithRegex(pattern) {
      stream.advance(countElements(match))
      return match
    }
    return nil
  }
}

func regex(pattern:String) -> Regex {
  return Regex(pattern:pattern)
}

This attempts to match the characters at the start of the input stream against the regular expression in pattern and returns the matching string if successful. This is now a powerful leaf parser that can be used to succinctly replace a lot of patterns that previously used multiple combinators, including identifier

let identifier = regex("[^\\(\\), ]+")

I would encourage the use of regex wherever it is a natural fit, because it pushes much of the complexity onto the platform's regex matcher (which is, presumably, more mature, stable, and efficient than my infant code). Regular expression syntax is well-known and well-documented and the ease with which complex patterns are built up is another reason to prefer it when possible. For instance, the identifier parser above is only as simple as it is because it was too much hassle to do something smarter. But with regex we can go a lot further with ease, for instance, a C-like identifier syntax is parsed by:

let clike_identifier = regex("[_a-zA-Z][_a-zA-Z0-9]*")

This would have been messy to write using just Satisfy, Many, and FollowedBy.

sepby

I added the sepby combinator which, like a lot of this, is also stolen from FParsec. sepby is used to parse a sequence of items of type T into an array [T], where there is a known pattern separating the items. There are three cases to consider when parsing lists:

  1. An empty list, which results in an empty array.
  2. A singleton, which contains no separators, and yields an array of length 1.
  3. A list of n items, which are separated by n − 1 instances of the separator pattern.

Although we could parse this with existing operators, it would be a big, clumsy expression for such a common construction. Something like parsing a comma-separated list of words is now straightforward

let skip = regex("\\s*")
let word = regex("\\w+") ~> skip
let comma = const(",") ~> skip
let list = sepby(word, comma)

FloatParser

We already saw the Integer parser in the previous post. I've also added the FloatParser:

struct FloatParser : Parser {
  typealias Target = Double
  private let strict:Bool
  static func stringToFloat(str:String) -> Double {
    return (str as NSString).doubleValue
  }
  static let impl = pipe(
        regex("[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"),
        FloatParser.stringToFloat)
  init(strict:Bool) {
    self.strict = strict
  }
  func parse(stream:CharStream) -> Target? {
    if !strict {
      return FloatParser.impl.parse(stream)
    }
    let start = stream.pos
    if let ip = Integer().parse(stream) {
      let intend = stream.pos
      stream.pos = start
      if let fp = FloatParser.impl.parse(stream) {
        if stream.pos == intend {
          stream.pos = start
          return nil
        }
        return fp
      }
    }
    return FloatParser.impl.parse(stream)
  }
}

While this is mostly straightforward, I've added the option of a strict floating-point number, which only treats a sequence of characters as a floating point number if it cannot be interpreted as an integer.

JSON parser

JavaScript object notation (JSON) is a popular format for serializing data1. With as much of the parsing framework as we now have, a JSON parser is remarkably easy to implement. First, some definitions:

typealias JSObject = [String:JSValue]
func makeObj(values:[(String,JSValue)]) -> JSObject {
  var record = JSObject()
  for (name,val) in values {
    record[name] = val
  }
  return record
}

We're going to parse JSON into a dictionary of string key-value pairs. In turn, JSValue is a discriminated union with some helper methods:

enum JSValue {
  case string(String)
  case number(Double)
  case object(JSObject)
  case array([JSValue])
  case bool(Bool)
  case null

  static func make(s:String) -> JSValue {return string(s)}
  static func make(n:Double) -> JSValue {return number(n)}
  static func make(b:Bool) -> JSValue {return bool(b)}
  static func make(o:JSObject) -> JSValue {return object(o)}
  static func makeary(a:[JSValue]) -> JSValue {return array(a)}
}

Notice that JSValue is self-referential, and that JSObject and JSValue are mutually-recursive. It should therefore come as no surprise to see that the implementation of JSParser makes use of LateBound to deal with these loopy constructs:

struct JSParser : Parser {
  static let skip = regex("\\s*")
  static let dquote = const("\"")
  static let ocurly = const("{") ~> skip
  static let ccurly = const("}") ~> skip
  static let obrack = const("[") ~> skip
  static let cbrack = const("]") ~> skip
  static let comma = const(",") ~> skip
  static let colon = const(":") ~> skip

  static let string = dquote >~ regex("[^\"]*") ~> dquote ~> skip
  static let stringval = string |> JSValue.make
  static let number = FloatParser(strict:false) |> JSValue.make
  static let object = LateBound<JSObject>()
  static let objval = object |> JSValue.make
  static let array = LateBound<JSValue>()
  static let bool = (const(true) | const(false)) |> JSValue.make
  static let null = token(const("null"), JSValue.null)
  static let value = (null | objval | array | bool | stringval | number) ~> skip

  static let pair = string ~>~ (colon >~ value)
  static let objimpl = ocurly >~ sepby(pair, comma) ~> ccurly |> makeObj
  static let arrayimpl = obrack >~ sepby(value, comma) ~> cbrack |> JSValue.makeary

  static func parse(stream:CharStream) -> JSObject? {
    // Close the loops on the recursive parts
    object.inner = objimpl.parse
    array.inner = arrayimpl.parse
    return object.parse(stream)
  }

  typealias Target = JSObject
  func parse(stream:CharStream) -> Target? {
    return JSParser.parse(stream)
  }
}

and that's it!

Next

Hopefully this JSON example has demonstrated some of the power of this approach. However, there are two big gaps in the framework's capabilities right now. The first is the ability to parse infix notation. Although it is possible to parse infix with with a vanilla recursive descent parser, it's painful. If this parsing framework is to be a serious contender in which to implement a real programming language, we'll need to be able to handle infix (and postfix) notation, as well as the prefix-heavy style that RD is so good at. The second omission is that there is no error handling. If there's an error in an input stream, we simply return nil. That isn't good enough: especially it we aspire to being the front end of a compiler. How might we add useful error handling?

If I can reach the keyboard over my distended abdomen in the next week, I'll get on to addressing these issues.


  1. It does use a lot of double quotes, which is a pain, but at least it's not XML