A parser framework for Swift

I am writing a simple parser framework in Swift; mostly as a learning exercise. Specifically, I am writing a recursive descent parser. Having never been a fan of the traditional lex/yacc parser generators I first came across recursive descent in the Boost Spirit C++ library. Since then, the framework I have used most is FParsec for F#1. What I'm doing here borrows heavily from FParsec.

This post is simply going to describe the basic features and key classes of this framework, working up to a demonstration showing how we can build a simple parser for S-expressions. You can find the project's code on GitHub: https://github.com/ollie-williams/SwiftParsec . You'll see that the code in this post is simplified in order to focus on the main points and not get sidetracked by details.

At the heart of the framework is the Parser protocol

protocol Parser {
  typealias Target
  func parse(CharStream) -> Target?
}

An implementation of Parser is a unit which translates characters into an output of type Target. The CharStream class, as its name suggests, supplies a stream of unicode characters representing the input we want to parse. To save space, I'll omit details of CharStream for now. Its behavior should be pretty obvious from context. The return type of Parser.parse is an Option<Target>, used to indicate success or failure.

Basic parsers

Let's demonstrate an implementation of Parser with something simple: a constant string parser

class Constant : Parser {
  let value: String

  init(value:String) {
    self.value = value
  }

  typealias Target = String

  func parse(stream: CharStream) -> Target? {
    if stream.startsWith(value) {
      stream.advance(countElements(value))
      return value
    }
    return nil
  }
}

Constant.parse checks whether the leading characters in the stream match a specific string value. If so, it pulls those characters off the stream via the advance method and returns Some(value) indicating success. If the head of the stream doesn't match, parsing fails, the stream is left unaltered, and nil is returned to indicate this failure to the caller.

Constant is a parser which directly converts characters into a value. I refer to such parsers as basic parsers, as these are the building blocks from which more complex parsers are constructed. Another example of a basic parser is Integer:

class Integer : Parser {
  typealias Target = Int

  func parse(stream:CharStream) -> Int? {
    let regex = "[+-]?[0-9]+"
    if let match = stream.startsWithRegex(regex) {
      stream.advance(countElements(match))
      return match.toInt()
    }
    return nil
  }
}

Parser combinators

If basic parsers are the building blocks, combinators are the glue used to build them into something grander. One of the most important of these is FollowedBy

class FollowedBy<T1 : Parser, T2 : Parser> : Parser {
  let first : T1
  let second: T2

  init(first:T1, second:T2) {
    self.first = first
    self.second = second
  }

  typealias R1 = T1.Target
  typealias R2 = T2.Target
  typealias Target = (R1,R2)

  func parse(stream: CharStream) -> Target? {
    let old = stream.position
    if let a = first.parse(stream) {
      if let b = second.parse(stream) {
        return (a,b)
      }
    }
    stream.position = old
    return nil
  }
}

FollowedBy connects two parsers "in series"; to parse the whole we must successfully parse both the parts sequentially.

FParsec makes liberal use of the fact that, in F#, one can define arbitrary operators. This makes building up quite large parsers succinct and straightforward. Swift also has this facility which means I can crib even more goodness from FParsec. In the case of FollowedBy we define the infix ~>~ operator

infix operator ~>~ {associativity left precedence 140}
func ~>~ <T1: Parser, T2: Parser>(first: T1, second: T2) -> FollowedBy<T1,T2> {
  return FollowedBy(first: first, second: second)
}

Using this we can now build a parser for the string "Hello World" like this:

let h = Constant(value: "Hello")
let s = Constant(value: " ")
let w = Constant(value: "World")
let hwparser = h ~>~ s ~>~ w

func parse<T:Parser>(parser:T, string:String) -> T.Target? {
  var stream = CharStream(str: string)
  return parser.parse(stream)  
}

println(parse(hwparser, "Hello World"))
println(parse(hwparser, "Goodbye cruel world"))

which produces the output:

Optional(((Hello,  ), World))
nil

Because we declared ~>~ to be left associative the type hwparser.Target is a tuple of tuples: ((String,String), String), as this output reveals. Frequently, when connecting parsers in series, we don't want to keep both sides. In these cases there are also the operators ~> and >~ which only keep the output of the left and right hand sides, respectively.

Another useful combinator is Alternate, which connects two parsers in parallel. As with FollowedBy an alternate parser can be created via the | operator. If we define

let p1 = ...
let p2 = ...
let alt = p1 | p2

alt attempts to parse a stream by first applying p1 to it. If p1.parse returns nil p2 is attempted. For example, the program

let h = Constant(value: "Hello")
let w = Constant(value: "World")
let alt = h | w

println(parse(alt, "Hello"))
println(parse(alt, "World"))
println(parse(alt, "Goodbye"))

produces the output:

Optional("Hello")
Optional("World")
nil

Another heavily used combinator is Pipe, which translates the output of a parser via an arbitrary function. This too has an infix operator

class Pipe<T:Parser, V> : Parser {
  typealias Target = V
  typealias R = T.Target

  let parser: T
  let fn    : R -> V

  init(inner: T, fn: R -> V) {
    self.parser = inner
    self.fn = fn
  }

  func parse(stream: CharStream) -> Target? {
    if let value = parser.parse(stream) {
      return fn(value)
    }
    return nil
  }
}

infix operator |> {associativity left precedence 100}
func |> <T: Parser, V>(inner: T, fn: T.Target -> V) -> Pipe<T,V> {
  return pipe(inner, fn)
}

Repetition

Another useful way to build up parsers is via repetition. The Many operation is used to represent one or more instances of the same pattern:

class Many<T:Parser> : Parser {
  typealias Target = [T.Target]

  let body: T
  let emptyOk: Bool

  init(body:T, emptyOk:Bool) {
    self.body = body
    self.emptyOk = emptyOk
  }

  func parse(stream:CharStream) -> Target? {
    var result = Target()
    while let r = body.parse(stream) {
      result.append(r)
    }
    if !emptyOk && result.count == 0 {
      return nil
    }
    return result
  }
}

func many<T:Parser>(body:T) -> Many<T> {
  return Many(body:body, emptyOk:true)
}

func many1<T:Parser>(body:T) -> Many<T> {
  return Many(body:body, emptyOk:false)
}

Recursion

I said this is a recursive descent parser. So where's the recursion? That's the least obvious part of the puzzle. To demonstrate it, let's try to write a less trivial parser—one to parse Lisp's S-expressions. Here is an example of an S-expression:

(f (add a (g b)) a (g c))

To begin with, I'll define the target type for our parser:

class Expr {
  let symbol  : String
  let children: [Expr]

  init(symbol:String, children:[Expr]) {
    self.symbol = symbol
    self.children = children
  }

  class func makeFn(symbol:String, children:[Expr]) -> Expr {
    return Expr(symbol:symbol, children:children)
  }

  class func makeLeaf(symbol:String) -> Expr {
    return Expr(symbol:symbol, children:[])
  }
}

As you can see, Expr is a recursive type, since the children member is itself an array of Exprs. There are two types of expression. Leaves, have no children and just appear as a stand-alone identifier. In our example, a, b, and c are leaves. A function call consists of a function symbol and zero or more child expressions. (add a (g b)) is a function call, with function symbol add and the children are a and (g b).

Let's start building up a parser for Expr by building a parser for identifiers:

let skip = manychars(constchar(" "))
func idChar(c:Character) -> Bool {
  switch c {
    case "(", ")", " ":
      return false
    default:
      return true
  }
}
let identifier = many1chars(satisfy(idChar)) ~> skip

I have made use of the helper functions manychars and many1chars, as well as satisfy. For sake of space, I won't describe these here, and hope that their usage and behavior are clear from context. skip is a very primitive skipper which simply jumps over spaces. A full skipper would need to account for other types of whitespace. identifier then treats any sequence of one or more characters, which are not either a space or a parenthesis, as an identifier. The identifier also does some housekeeping for us by skipping any whitespace that follows the string we want.

Since leaf expressions are just identifiers, these are easily parsed with a pipe to perform the translation:

let leaf = identifier |> Expr.MakeLeaf

Parsing a function call is where it gets interesting, since there's a recursion. To manage this we can use the LateBound parser:

class LateBound<T>: Parser {
  typealias Target = T
  typealias ParseFunc  = CharStream -> T?

  var inner: ParseFunc?

  init() {}

  func parse(stream:CharStream) -> T? {
    if let impl = inner {
      return impl(stream)
    }
    fatalError("No inner implementation was provided for late-bound parser.")
    return nil
  }
}

LateBound enables us to define "loopy" parsers because it decouples the creation of an object satisfying Parser from the actual function implementing the parsing operation. Armed with LateBound we can complete our S-expression parser:

var expr = LateBound<Expr>()
let oparen = const("(") ~> skip
let cparen = const(")") ~> skip
let fnCall = oparen >~ pipe2(identifier, many(expr), Expr.MakeFn) ~> cparen
let choice = fnCall | leaf
expr.inner = choice.parse

To test it, let's parse an S-expression, and translate it to C-style function calls:

func cStyle(expr:Expr) -> String {
  if expr.children.count == 0 {
    return expr.symbol
  }
  var args = cStyle(expr.children[0])
  for i in 1..<expr.children.count {
    args = args + ", " + cStyle(expr.children[i])
  }
  return "\(expr.symbol)(\(args))"
}

let sexpr = "(f (add a (g b)) a (g c))"
let result = parse(expr, sexpr)
println("\(sexpr) = \(cStyle(result!))")

This produces:

(f (add a (g b)) a (g c)) = f(add(a, g(b)), a, g(c))

as we expected.


  1. FParsec is itself a port of the Parsec package for Haskell.