I haven’t yet started repeating myself — though there’s every chance you’ll hear it here twice — but iteration is the sort of thing one can find just one use after another for. I mean, if you’ve seen it once, you’ve seen it a thousand times: iteration delivers repeatedly. How many times have you iterated to good effect? I say again: is iteration great or what?

Ok, got that out of my system. :-) I am working on lambda encodings right now, and with Church-encoded data, every piece of data is its own iterator. So the encoding tends to make one think of algorithms in terms of iteration. We have a function f, and a starting point a, and we wish to apply f to a in a nested fashion, n times: $f^0(a) = a$ and $f^{n+1}(a) = f(f^n(a))$. To multiply numbers N and M, for example, we can iterator the function “add M” on starting point 0, N times. And other natural algorithms have iterative formulations.

What about division? Usually in total type theories (where it is required that uniform termination of every function — that is, termination on all inputs — has to be confirmed statically by some termination-checking algorithm or technique), natural-number division is implemented using well-founded recursion. The idea is that to divide x by y, we basically want to see how many times we can subtract y from x until x becomes smaller than y (at which point it is the remainder of the division). So one wants to make a recursive call to division on x – y, and since that quantity is not the predecessor of x (or y), the usual structural decrease demanded by the termination checker is not satisfied. So the usual simple schemes for observing termination statically cannot confirm that division is terminating. And indeed, if y were 0, there would be no structural decrease. So it is not a completely trivial matter. The solution one finds in Coq (Arith/Euclid.v in the standard library for Coq version 8.4) and Agda (Data/Nat/DivMod.agda in the standard library version 0.8) is to use well-founded recursion. This is a somewhat advanced method that uses a generalized inductive type to encode, effectively, all the legal terminating call sequences one could make using a given well-founded ordering. Then we can do structural recursion on an extra argument of this generalized inductive type.

Well-founded recursion is really quite cool, and it’s amazing to see the power of the type theory in the fact that well-founded recursion is derivable, not primitive, to the language. Every student of type theory should try walking through the definitions needed for well-founded recursion over, say, that natural number ordering <. But as elegant and impressive as it is, it's a pretty heavy hammer to have to get out. For starters, if you want to reason later about the function you defined by well-founded recursion, you are most likely going to have to use well-founded induction in that reasoning. So you find yourself continually setting up these somewhat complicated inductions to prove simple lemmas. A second issue is that at least in Agda, because there is no term erasure explicit in the language, if you write a function by well-founded recursion, you are going to be manipulating these values of the generalized inductive datatype at runtime. I reported earlier on this blog that in my experience this led to a major, major slowdown for running code extracted from Agda. So if you are just doing some formal development to prove a theorem, then well-founded recursion won't cause you serious problems in Agda. But if you want to extract and run code that uses well-founded recursion, you likely will see major performance issues.

In my standard library for Agda, the version of natural-number division defined by well-founded recursion is in nat-division.agda:

{- a div-result for dividend x and divisor d consists of the quotient q, remainder r, and a proof that q * d + r = x -}
div-result : ℕ → ℕ → Set
div-result x d = Σ ℕ (λ q → Σ ℕ (λ r → q * d + r ≡ x))

div-helper : ∀ (x : ℕ) → WfStructBool _<_ x → (y : ℕ) → y =ℕ 0 ≡ ff → div-result x y
div-helper x wfx 0 ()
div-helper x (WfStep fx) (suc y) _ with 𝔹-dec (x =ℕ 0)
... | inj₁ u = 0 , 0 , sym (=ℕ-to-≡ u)
... | inj₂ u with 𝔹-dec (x < (suc y))
... | inj₁ v = 0 , (x , refl)
... | inj₂ v with (div-helper (x ∸ (suc y)) (fx (∸< {x} u)) (suc y) refl)
... | q , r , p with <ff {x} v
... | p' with ∸eq-swap{x}{suc y}{q * (suc y) + r} p' p
... | p'' = (suc q) , (r , lem p'')
where lem : q * (suc y) + r + suc y ≡ x → suc (y + q * suc y + r) ≡ x
lem p''' rewrite
+suc (q * (suc y) + r) y
| +comm y (q * (suc y))
| +perm2 (q * (suc y)) r y = p'''

_÷_!_ : (x : ℕ) → (y : ℕ) → y =ℕ 0 ≡ ff → div-result x y
x ÷ y ! p = div-helper x (wf-< x) y p

This code returns a value of type div-result x y, which contains the quotient q, remainder r, and the proof that x equals y * q + r. It is not as simple as one would like, due to the use of well-founded recursion.

But we can avoid well-founded recursion for defining division, if we go back to our old friend iteration (“There he is again!” — sorry, I said I had that out of my system, but apparently not quite). Because we know that we will not possibly iterate subtraction of y from x more than x times, if y is not 0. So we can pass an extra argument in to division which is a counter, that we start out at x. Again we use the div-result type, but this time there is no need for well-founded recursion:

divh : (n : ℕ) → (x : ℕ) → (y : ℕ) → x ≤ n ≡ tt → y =ℕ 0 ≡ ff → div-result x y
divh 0 0 y p1 p2 = 0 , 0 , refl
divh 0 (suc x) y () p2
divh (suc n) x y p1 p2 with keep (x < y)
divh (suc n) x y p1 p2 | tt , pl = 0 , x , refl
divh (suc n) x y p1 p2 | ff , pl with divh n (x ∸ y) y (∸≤2 n x y p1 p2) p2
divh (suc n) x y p1 p2 | ff , pl | q , r , p = suc q , r , lem
where lem : y + q * y + r ≡ x
lem rewrite sym (+assoc y (q * y) r) | p | +comm y (x ∸ y) = ∸+2{x}{y} (<ff{x}{y} pl)

_÷_!_ : (x : ℕ) → (y : ℕ) → y =ℕ 0 ≡ ff → div-result x y
x ÷ y ! p = divh x x y (≤-refl x) p

You can find this in nat-division2.agda. The code is also a bit less cluttered with helper lemmas, although we still do need to require that x is less than or equal to n, in order to rule out the case that we run out of counter budget (n) before we are done dividing x.

This example shows that sometimes iteration is sufficient for defining functions like division whose natural definition is not structurally recursive. The moral of the story is that we should not forget about iteration. And that is a lesson worth repeating!

Well, I am embarrassed at how late I am in posting the solution to the puzzle I mentioned in my last post.  It has been a busy summer with taking care of our sweet new baby at home, and running StarExec development at work.  Anyhow, below is a graph with the minimum number of nodes which contains every legal possible combination of properties from termination (aka strong normalization), normalization (aka weak normalization), confluence (aka Church-Rosser), and local confluence (aka Weak Church-Rosser), and their negations.  This graph was found by Hans Zantema, whom I asked about this puzzle by email (he agreed to let me share his solution here).  Furthermore, he argues that 11 is the minimal number of nodes, as follows.  Out of the 16 possible combinations of properties SN, WN, CR, and WCR and their negations, we exclude immediately the combinations with SN and ~WN (since SN implies WN) and CR and ~WCR (since CR implies WCR).  So there are three legal possibilities for the values of CR and WCR, and three for the values of SN and WN.  These are independent, so there are 9 legal combinations of properties.  Now, Hans argues, since there is a node X which is SN and ~WCR, there must be two nodes which are SN and CR.  For since X is SN but not WCR, it has two children (which are still SN) which cannot be joined.  We may assume these children are themselves CR, otherwise we could repeat this observation and the graph would not be minimal.  Similarly, since there is a node which is ~WN and ~WCR, there must be two nodes which are ~WN and CR.  So there must be at least 11 nodes.  And the graph below has 11 nodes.  To test your knowledge, you can try to identify which combination of properties each node has!  Fun!

Suppose we have a graph (A,->) consisting of a set of objects A and a binary relation -> on A.  This is a simple case of an abstract reduction system, as defined in the Terese book (in the more general case, we have not just one relation ->, but an indexed set of relations).  In the theory of abstract reduction systems, an element x is confluent iff whenever there is a path from x to y and a path from x to z, then there exists some element q which is reachable from both y and z.  An element x is locally confluent iff whenever there is an edge (not an arbitrary path) from x to y and an edge from x to z, then there is some element q reachable from both y and z.  So confluence implies local confluence, but (rather famously) the reverse implication holds only for terminating systems.  An element is terminating iff there are no infinite paths from that element.  An element is normalizing iff there exists a path from that element to a normal form, which is an element that has no outgoing edges.  So terminating implies normalizing.

We have these four properties: confluence, local confluence, termination (sometimes also called strong normalization), and normalization (sometimes called weak normalization).  What is the smallest graph that is property diverse, in the sense that for every consistent combination of properties, the graph contains an element with that combination of properties?  (The consistency requirement for the set of properties for an element arises because confluence implies local confluence and termination implies normalization).

I will post the answer to this (with a nice picture) Monday…

It has been a very long time since I posted here — life has been busy, including a new baby at home.  But I really want to share about my recent experience tackling several performance problems in Agda.  Agda, as I hope you know already, is a very elegant dependently typed pure functional programming language.  It supports Unicode, so you can write → instead of -> for function space; and many other cool symbols.  It has user-defined mixfix notation, so you can define if_then_else_ (you write underscores in Agda to show where the arguments go) with the expected syntax.  It compiles to Haskell, although I get the impression that many people just use Agda as an advanced type checker, and do not bother compiling executables.  Agda has very good inference for implicit arguments, which can help make code shorter and more readable.  It also features definition of terminating functions by recursive equations.  So you can write beautiful definitions like the following for infix vector append:

_++𝕍_ : ∀ {ℓ} {A : Set ℓ}{n m : ℕ} → 𝕍 A n → 𝕍 A m → 𝕍 A (n + m)
[] ++𝕍 ys = ys
(x :: xs) ++𝕍 ys = x :: (xs ++𝕍 ys)

You might object to naming the function _++𝕍_ instead of _++_, but Agda does not support type classes or other approaches to operator overloading, and I prefer never to have to worry about symbols clashing from different included files.  This definition is from my standard library (not the Agda standard library; if prompted for username and password, enter “guest” for both).  There is a very nice emacs mode, too.

With all these great features, working well together, Agda provides just about the most elegant programming experience I have had in 21 years of coding.  I think it is a fantastic language, with much to emulate and learn from.  These accolades aside, my purpose in this post is to discuss some grotesque workarounds for performance problems inherent in the implementation and maybe even the language.  To be direct: Agda’s type checker has abysmal performance.  Suppose we create an Agda source file defining test to be a list containing 3000 copies of boolean true.  Agda takes 12.5 seconds to type check this file on my laptop.  If we give the same example to OCaml, it takes 0.2 seconds, in other words, heading towards two orders of magnitude slower.  Now, is Agda’s type checker doing fancier things than OCaml’s?  Undoubtedly.  But not on this example!  I am willing to accept some overhead in general for fancier type checking even on code that does not use fancy types.  And OCaml has been around for quite some time and is engineered by one of the best language implementors on the planet.  Fine.  So let Agda be 2 times slower.  Let it be 5 times slower.  But 60 times slower?  That is not good.

I ran into the performance issues with Agda’s type checker while tilting at the windmill I’ve been dueling off and on the past three years: parsing by rewriting.  Without going into detail here, let me just say I’ve been working, with very patient colleague Nao Hirokawa, to create a new approach to parsing based on term rewriting.  The idea as it stands now is that one runs an automaton approximating the language of your CFG over the input string, to create a formal artifact called a run.  Then one applies confluent run-rewriting rules to this run, where those rules are derived from the productions of the grammar and will rewrite every string in the language to a term built from the start symbol and then containing the parse tree as a subterm.  I love the approach, because it is inherently parallelizable (because the run-rewriting rules are confluent), and because we can resolve ambiguities in the grammar by adding rewrite rules.  The trickiest part is coming up with the right approximating automaton, and this is still not at all ready for prime time (despite the fact that I inflicted it on my undergraduate class this semester).

Anyhow, since I have been teaching undergrad PL this semester using Agda (more on this in a later post), and since Agda does not have a parser generator (perhaps because FPers seem to prefer parser combinators), I decided I would make my parsing-by-rewriting tool, called gratr, target Agda as a backend.  After a fair bit of hacking I had this working, only to discover that for even pretty small grammars, I was generating several thousand line Agda source files, which Agda absolutely could not handle.  Imagine my disappointment!  Beautiful (if not yet ready for big grammars) approach to parsing, targeting Agda, and the Agda type checker could not handle my generated parsers’ source files, except for the tiniest of grammars.  I was depressed, and determined to find a way around this problem to get working parsers from the medium-small grammars I wanted to use for class, as well as research.

Enter –use-string-encoding.  A little experimentation revealed that while Agda chokes checking even very simple terms when they get even moderately big, it will affirm that double-quoted strings indeed have type string in time seemingly independent of string size.  Oh the fiendery.  Let us encode all our parsing data structures — that means automaton and run-rewriting rules both — as strings, get them past the Agda type checker, and then decode at runtime to get our parser.  It is gross, it is nasty, but it might just work.  Of course, no one wants to decode their parsing data structures every time a string is parsed, but that was the price I decided I’d be willing to pay to get my parsers running in Agda.

I spent a month or so — while learning to care, with my wife, for a very fussy newborn — implementing this.  I finally had code in gratr to dump all the data structures as strings, and code in Agda to decode those strings and plug them into my existing parsing-by-rewriting infrastructure.  Agda could type check the file containing the strings in a second or two, even when the strings were huge (megabytes-long files, due to the unfortunately large automata my approach is currently producing).  The moment of truth arrives: let us actually compile the entire shebang to an executable (not just type check that one file).  Agda type-checking chokes.  I cannot believe what I am seeing.  What is happening?  I can type check the files containing the string-encoded data structures almost instantly, but type-checking the wrapper file defining the main entry point (which is just based off the way Haskell sets up code for compilation to executables) is running seemingly forever.  A little quick experimentation reveals: big strings encoding the data structures makes type checking that file super slow.  What gives!  Further head scratching leads me to suspect that for some reason, when Agda is instantiating a module somewhere in my setup, it is actually trying to normalize the fields of a record, where those fields are calling the decode functions on the string encodings.  This is the step that could take a second or two at runtime, with ghc-optimized executables, but will likely take forever with Agda’s compile-time interpreter.  How to make Agda stop doing this?

Here’s a hack for this:

postulate
runtime-identity : ∀{A : Set} → A → A
{-# COMPILED runtime-identity (\ _ x -> x ) #-}

The idea is that we will interpose the postulate called runtime-identity to block the redex of decoder applied to string encoding.  At compile time, Agda knows nothing about runtime-identity, and hence will not be able to reduce that redex.  But at runtime, runtime-identity will be compiled to the identity function in Haskell, and hence the redex will reduce.

Delightfully, this worked.  Compiling the emitted parsers with the string encodings is now very quick, maybe 10 seconds to get all the way through Agda to Haskell to an executable.  Awesome!  Now let’s just run the executable.  Heh heh.  There is no way that could not work, right?

Wrong, of course.  Running the executable on a tiny input string takes 7 minutes and then I kill it.  Oh my gosh, I am thinking.  I just spent 5 weeks of precious coding time (in and around other duties, especially new childcare duties, with my patient wife looking on and wondering…) to get this gross hack working, and now the runtime performance is unacceptable.  I almost despair.

But hark! Reason calls.  It cannot be taking a ghc-optimized executable that long to decode my string-encoded data structures.  After all, encoding the data structures to strings from my OCaml gratr implementation is instantaneous.  Sure, decoding could be a bit longer, but forever longer?  That can’t be right.  So how can we figure out what is going on?

Easy: profile the compiled code with ghc’s profiling features.  Agda just compiles down to (almost unreadable) Haskell, which is then compiled by ghc, so we can just profile the code with ghc.  I had never used ghc’s profiler, but it was very simple to invoke and the results were easily understandable.  Where is the time going?  Here is the scary line:

d168 MAlonzo.Code.QnatZ45Zthms 322 7928472 10.9 36.0 55.0 53.5

The last numbers are showing that over half the time of the executable is going into function d168 in nat-thms.agda.  A function in nat-thms.agda?  That contains a bunch of lemmas and theorems about natural-number operations.  I hardly expect my parser to be grunting away there.  What is d168?  Well, it is the Agda-emitted version of this lemma:

<-drop : ∀ {x y : ℕ} → (x < (suc y) ≡ tt) → x ≡ y ∨ x < y ≡ tt

This function looks to take linear time in the size of x, which could be the length of the emitted string encoding in this case.  Where on earth is this called from?  And why is its evaluation getting forced anyway in Haskell’s lazy evaluation model?  <-drop is getting called in

wf-< : ∀ (n : ℕ) → WfStructBool _<_ n

This is the proof that the _<_ ordering on natural numbers is well-founded.  The string-decoding functions have to use well-founded recursion for Agda to see they are terminating.  You recursively decode some part of the string, and then need to continue on the residual part of the string that has not been decoded yet, which is returned by your recursive call.  Agda cannot see that the residual you are recursing on is a subterm of the starting input string, so it cannot confirm the function is structurally terminating.  The solution is to use well-founded recursion.  And this is taking, as far as I can tell, time quadratic in the size of the input string to be decoded.  These strings are long, so a quadratic time operation (with lots of recursion and pattern matching) is going to kill us.

What is the solution?  Strip out the well-founded recursion and just disable Agda’s termination checker.  I do this, cross my fingers, compile, run, and … it works!  Hot diggety.

So those are the three performance problems we tackled here in Agda: slow type checking (just avoid the type checker altogether by encoding big data structures as strings and decoding at runtime), unwanted compile-time evaluation (interpose postulated runtime-identity to block the redexes), and super slow well-founded recursion (punt and disable the termination checker).  I am interested in any similar experiences readers may have had….

Our shipment from HP of 160 nodes for StarExec arrived this week.  Pictures below courtesy of Aran Cox:

Everything needs to be installed and configured, so it will be some time yet (4-6 weeks) until this is incorporated into StarExec.  At that point, we will finally throw open the doors to the public…