User:Duplode/Arrow parser rewrite

Can it be done better?

Swierstra and Duponcheel (1996) noticed that, in some use cases, a smarter parser could immediately fail upon seeing the very first character. For example, in the nums parser above, the choice of first letter parsers was limited to either the letter 'o' for "one" or the letter 't' for both "two" and "three". This smarter parser would also be able to garbage collect input sooner because it could look ahead to see if any other parsers might be able to consume the input, and drop input that could not be consumed. This new parser is a lot like the monadic parsers with the major difference that it exports static information. It's like a monad, but it also tells you what it can parse.

There's one major problem. This doesn't fit into the Monad interface. Monadic composition works with (a -> m b) functions, and functions alone. There's no way to attach static information. You have only one choice, throw in some input, and see if it passes or fails.

Back when this issue first arose, the monadic interface was being touted as a completely general purpose tool in the functional programming community, so finding that there was some particularly useful code that just couldn't fit into that interface was something of a setback. This is where arrows come in. John Hughes's Generalising monads to arrows proposed the arrows abstraction as a new, more flexible tool.

Static and dynamic parsers

Let us examine Swierstra and Duponcheel's parser in greater detail, from the perspective of arrows as presented by Hughes. The parser has two components: a fast, static parser which tells us if the input is worth trying to parse; and a slow, dynamic parser which does the actual parsing work.

import Control.Arrow
import qualified Control.Category as Cat
import Data.List (union)

data Parser s a b = P (StaticParser s) (DynamicParser s a b)
data StaticParser s = SP Bool [s]
newtype DynamicParser s a b = DP ((a, [s]) -> (b, [s]))

The static parser consists of a flag, which tells us if the parser can accept the empty input, and a list of possible starting characters. For example, the static parser for a single character would be as follows:

spCharA :: Char -> StaticParser Char
spCharA c = SP False [c]

It does not accept the empty string (False) and the list of possible starting characters consists only of c.

The dynamic parser needs a little more dissecting. What we see is a function that goes from (a, [s]) to (b, [s]). It is useful to think in terms of sequencing two parsers: each parser consumes the result of the previous parser (a), along with the remaining bits of input stream ([s]), it does something with a to produce its own result b, consumes a bit of string and returns that. So, as an example of this in action, consider a dynamic parser (Int, String) -> (Int, String), where the Int represents a count of the characters parsed so far. The table below shows what would happen if we sequence a few of them together and set them loose on the string "cake" :

	result	remaining
before	0	cake
after first parser	1	ake
after second parser	2	ke
after third parser	3	e

So the point here is that a dynamic parser has two jobs : it does something to the output of the previous parser (informally, a -> b), and it consumes a bit of the input string, (informally, [s] -> [s]), hence the type DP ((a,[s]) -> (b,[s])). Now, in the case of a dynamic parser for a single character (type (Char, String) -> (Char, String)), the first job is trivial. We ignore the output of the previous parser, return the character we have parsed and consume one character off the stream:

dpCharA :: Char -> DynamicParser Char Char Char
dpCharA c = DP (\(_,x:xs) -> (x,xs))

This might lead you to ask a few questions. For instance, what's the point of accepting the output of the previous parser if we're just going to ignore it? And shouldn't the dynamic parser be making sure that the current character off the stream matches the character to be parsed by testing x == c? The answer to the second question is no − and in fact, this is part of the point: the work is not necessary because the check would already have been performed by the static parser. Naturally, things are only so simple because we are testing just one character. If we were writing a parser for several characters in sequence we would need dynamic parsers that actually tested the second and further characters; and if we wanted to build an output string by chaining several parsers of characters then we would need the output of previous parsers.

Time to put both parsers together. Here is our S+D style parser for a single character:

charA :: Char -> Parser Char Char Char
charA c = P (SP False [c]) (DP (\(_,x:xs) -> (x,xs)))

Bringing the arrow combinators in

With the preliminary bit of exposition done, we are now going to implement the Arrow class for Parser s, and by doing so, give you a glimpse of what makes arrows useful. So let's get started:

-- We explain the Eq s constraint below.
instance Eq s => Arrow (Parser s) where

arr should convert an arbitrary function into a parsing arrow. In this case, we have to use "parse" in a very loose sense: the resulting arrow accepts the empty string, and only the empty string (its set of first characters is []). Its sole job is to take the output of the previous parsing arrow and do something with it. That being so, it does not consume any input.

    arr f = P (SP True []) (DP (\(b,s) -> (f b,s)))

Likewise, the first combinator is relatively straightforward. Given a parser, we want to produce a new parser that accepts a pair of inputs (b,d). The first component of the input b, is what we actually want to parse. The second part passes through untouched:

    first (P sp (DP p)) = P sp (DP (\((b,d),s) ->
        let (c, s') = p (b,s)
        in  ((c,d),s')))

We also have to supply the Category instance. id is entirely obvious, as id = arr id must hold:

instance Eq s => Cat.Category (Parser s) where
    id = P (SP True []) (DP (\(b,s) -> (b,s)))
    -- Or simply: id = P (SP True []) (DP id)

On the other hand, the implementation of (.) requires a little more thought. We want to take two parsers, and return a combined parser incorporating the static and dynamic parsers of both arguments:

    -- The Eq s constraint is needed for using union here.
    (P (SP empty1 start1) (DP p2)) .
      (P (SP empty2 start2) (DP p1)) =
        P (SP (empty1 && empty2)
              (if not empty1 then start1 else start1 `union` start2))
          (DP (p2.p1))

Combining the dynamic parsers is easy enough; we just do function composition. Putting the static parsers together requires a little bit of thought. First of all, the combined parser can only accept the empty string if both parsers do. Fair enough, now how about the starting symbols? Well, the parsers are supposed to be in a sequence, so the starting symbols of the second parser shouldn't really matter. If life were simple, the starting symbols of the combined parser would only be start1. Alas, life is not simple, because parsers could very well accept the empty input. If the first parser accepts the empty input, then we have to account for this possibility by accepting the starting symbols from both the first and the second parsers

So what do arrows buy us?

If you look back at our Parser type and blank out the static parser section, you might notice that this looks a lot like the arrow instances for functions.

arr f = \(b, s) -> (f b, s)
first p = \((b, d), s) ->
            let (c, s') = p (b, s)
            in  ((c, d), s'))

id = id
p2 . p1 = p2 . p1

There's the odd s variable out for the ride, which makes the definitions look a little strange, but the outline of e.g. the simple first functions is there. Actually, what you see here is roughly the arrow instance for the State monad/Kleisli morphism (let f :: b -> c, p :: b -> State s c and (.) actually be (<=<) = flip (>=>)).

That's fine, but we could have easily done that with bind in classic monadic style, with first becoming just an odd helper function that could be easily written with a bit of pattern matching. But remember, our Parser type is not just the dynamic parser − it also contains the static parser.

arr f = SP True []
first sp = sp
(SP empty1 start1) >>> (SP empty2 start2) = (SP (empty1 && empty2)
        (if not empty1 then start1 else start1 `union` start2))

This is not at all a function, it's just pushing around some data types, and it cannot be expressed in a monadic way. But the Arrow interface can deal with just as well. And when we combine the two types, we get a two-for-one deal: the static parser data structure goes along for the ride along with the dynamic parser. The Arrow interface lets us transparently compose and manipulate the two parsers, static and dynamic, as a unit, which we can then run as a traditional, unified function.