Skip navigation

Category Archives: Coding

Greetings, QA9 reader(s).  It has been a while since I posted, for which I offer the usual excuses.  I have a backlog of interesting technical things I’m hoping to write about sometime pretty soon: failure of subject reduction in lambda2 with eta-contraction, which I learned about maybe a year ago but did not record in a post; also intriguing things I am reading about deep inference, although there the problem is the stuff is, well, deep and I haven’t understood it well enough yet to have too much to say.  I also had high hopes of posting about the super fun summer I had here at U. Iowa working on Cedille with great new U. Iowa graduate students Chad Reynolds, Ananda Guneratne, and Richard Blair; great visiting Indiana doctoral student Matthew Heimerdinger; and inspiring visitors Stéphane Graham-Lengrand, Olivier Hermant, and Vincent St-Amour.  In a perfect world I would have some pictures to post of all this stimulating interaction, but here I am photo-less.

But what impels me to write today is getting bitten again by performance problems with well-founded recursion in Agda.  Just for fun, I will tell you the story.  We are working on this new interactive theorem prover called Cedille based on dependently typed lambda encodings.  You can read about this in the project report we just revised to submit for the TFP 2016 post-proceedings.  Thanks to the determination of Matthew (Heimerdinger), we have a highlighting mode — actually several — for Cedille in our emacs interface.  One funny wrinkle was that we could not highlight comments, because our parser was just dropping them, and we did not want to store information about them in our abstract syntax trees (which would then become very cluttered).  We hit upon a daring (for us, anyway) solution: parse input files a second time, with a different grammar that is recognizing just whitespace and comments.  It is nice to find the whitespace, too, in addition to the comments, since then we can set highlighting to the default for those spans of the text in the emacs mode.  Otherwise, with the way our interaction between emacs frontend and Agda backend works (see that revised project report), if you type in whitespace you will likely get a non-default highlighting color, from the span containing that whitespace.

Ok, so we (Richard) wrote the grammar for this, and we started producing comment and whitespace spans to send to the frontend, only to see some very sluggish performance on a couple longer files.  We misdiagnosed this as coming from doing lots of string appends (with what correspond to slow strings in Haskell; i.e., lists of characters), and I dutifully changed the code to print the spans to a Haskell Handle, rather than computing one big string for all the spans and then printing that.  Distressingly, this optimization, which took maybe an hour or so to implement, with finding the right Haskell stuff to hook up to, did not improve performance.  Well shucks!  Let’s get out our handy Haskell profiler and see what is going wrong.  Now sadly, I am still on Agda 2.4.2.4, which lacks this nifty agda-ghc-names tool that is part of Agda 2.5.1 now.  That tool translates Haskell names that are generated by the Agda compiler, back to names from the Agda source.  But this tool was not needed to see the following:

COST CENTRE MODULE %time %alloc

d179.\.\.\ MAlonzo.Code.QnatZ45Zthms 35.9 18.8
d179.\ MAlonzo.Code.QnatZ45Zthms 11.6 48.8
d191.\ MAlonzo.Code.QnatZ45Zthms 9.6 7.5
d179 MAlonzo.Code.QnatZ45Zthms 5.7 0.0
d191 MAlonzo.Code.QnatZ45Zthms 5.2 0.0
d179.\.\ MAlonzo.Code.QnatZ45Zthms 4.7 11.3

 

Oh pondwater.  Over half the time is being spent in functions in nat-thms.agda in the IAL.  That file contains, as the name suggests, just theorems about the natural numbers — not code I expect to be running intensively during execution of the Cedille backend.  But I have seen this before, as I noted.  These are lemmas getting called again and again during a well-founded recursion, to show that certain less-than facts hold.  Which well-founded recursion was it?  It turns out it was for the function that converts a natural number to a string.  We are doing that operation many times, to print character positions in the input Cedille source file.  Just removing the well-founded recursion and marking the nat-to-string function as needing NO_TERMINATION_CHECK reduced the running time to less than half what it was.

The moral of this story is the same as the moral of the previous story: be wary of well-founded recursion in Agda for code you intend to run (as opposed to just reason about).

Have a blessed start of fall!

As I, like many others of good will and high aspiration, battle the POPL ’16 submission deadline, I feel compelled to write a quick post about how to get your Agda code through the type-checker faster.  For my submission fight, I have got about 3000 lines of Agda that is intended to be compiled and executed (imagine!), and Agda is getting bogged down.  Compilations are taking productivity-killing minutes instead of seconds, or not completing at all!  Argh.  Here are three things I did that seemed to help (YMMV, of course):

  1. Do not use anonymous lets in your code.  Always give a type for let-bound variables.  I had three nested anonymous lets in sequence, and they seemed to be slowing Agda down a lot.
  2. Instead of nested tuples, use a new datatype.  Agda seems to have more trouble with (a1 , a2 , a3 , a4) then with (c a1 a2 a3 a4), where c is a new constructor.
  3. Sadly, you may be forced to break up big mutually recursive functions.  I have got a doozy going, with maybe 12-15 mutually recursive functions.  Since they all have to go in one file (right?), we get a big Agda file, which seems to give Agda more trouble, generally, than a small Agda file.  The solution is to break up the mutually recursive functions.  To do this, you have to use {-# NO_TERMINATION_CHECK #-} right before the function declarations, and then you have to pass some of those functions explicitly to the other functions, to use for making the recursive calls.  So if f calls g and h, you put f into one file, and add two extra function arguments to f.  These are the functions to use when f wants to call g and wants to call h. Then you put the code for g and h in another file which imports the file containing f.  Where g or h calls f, now instead (f g h) should be called (since f is taking in g and h explicitly as arguments).

Fun fun!

It has been a very long time since I posted here — life has been busy, including a new baby at home.  But I really want to share about my recent experience tackling several performance problems in Agda.  Agda, as I hope you know already, is a very elegant dependently typed pure functional programming language.  It supports Unicode, so you can write → instead of -> for function space; and many other cool symbols.  It has user-defined mixfix notation, so you can define if_then_else_ (you write underscores in Agda to show where the arguments go) with the expected syntax.  It compiles to Haskell, although I get the impression that many people just use Agda as an advanced type checker, and do not bother compiling executables.  Agda has very good inference for implicit arguments, which can help make code shorter and more readable.  It also features definition of terminating functions by recursive equations.  So you can write beautiful definitions like the following for infix vector append:

_++𝕍_ : ∀ {ℓ} {A : Set ℓ}{n m : ℕ} → 𝕍 A n → 𝕍 A m → 𝕍 A (n + m)
[] ++𝕍 ys = ys
(x :: xs) ++𝕍 ys = x :: (xs ++𝕍 ys)

You might object to naming the function _++𝕍_ instead of _++_, but Agda does not support type classes or other approaches to operator overloading, and I prefer never to have to worry about symbols clashing from different included files.  This definition is from my standard library (not the Agda standard library; if prompted for username and password, enter “guest” for both).  There is a very nice emacs mode, too.

With all these great features, working well together, Agda provides just about the most elegant programming experience I have had in 21 years of coding.  I think it is a fantastic language, with much to emulate and learn from.  These accolades aside, my purpose in this post is to discuss some grotesque workarounds for performance problems inherent in the implementation and maybe even the language.  To be direct: Agda’s type checker has abysmal performance.  Suppose we create an Agda source file defining test to be a list containing 3000 copies of boolean true.  Agda takes 12.5 seconds to type check this file on my laptop.  If we give the same example to OCaml, it takes 0.2 seconds, in other words, heading towards two orders of magnitude slower.  Now, is Agda’s type checker doing fancier things than OCaml’s?  Undoubtedly.  But not on this example!  I am willing to accept some overhead in general for fancier type checking even on code that does not use fancy types.  And OCaml has been around for quite some time and is engineered by one of the best language implementors on the planet.  Fine.  So let Agda be 2 times slower.  Let it be 5 times slower.  But 60 times slower?  That is not good.

I ran into the performance issues with Agda’s type checker while tilting at the windmill I’ve been dueling off and on the past three years: parsing by rewriting.  Without going into detail here, let me just say I’ve been working, with very patient colleague Nao Hirokawa, to create a new approach to parsing based on term rewriting.  The idea as it stands now is that one runs an automaton approximating the language of your CFG over the input string, to create a formal artifact called a run.  Then one applies confluent run-rewriting rules to this run, where those rules are derived from the productions of the grammar and will rewrite every string in the language to a term built from the start symbol and then containing the parse tree as a subterm.  I love the approach, because it is inherently parallelizable (because the run-rewriting rules are confluent), and because we can resolve ambiguities in the grammar by adding rewrite rules.  The trickiest part is coming up with the right approximating automaton, and this is still not at all ready for prime time (despite the fact that I inflicted it on my undergraduate class this semester).

Anyhow, since I have been teaching undergrad PL this semester using Agda (more on this in a later post), and since Agda does not have a parser generator (perhaps because FPers seem to prefer parser combinators), I decided I would make my parsing-by-rewriting tool, called gratr, target Agda as a backend.  After a fair bit of hacking I had this working, only to discover that for even pretty small grammars, I was generating several thousand line Agda source files, which Agda absolutely could not handle.  Imagine my disappointment!  Beautiful (if not yet ready for big grammars) approach to parsing, targeting Agda, and the Agda type checker could not handle my generated parsers’ source files, except for the tiniest of grammars.  I was depressed, and determined to find a way around this problem to get working parsers from the medium-small grammars I wanted to use for class, as well as research.

Enter –use-string-encoding.  A little experimentation revealed that while Agda chokes checking even very simple terms when they get even moderately big, it will affirm that double-quoted strings indeed have type string in time seemingly independent of string size.  Oh the fiendery.  Let us encode all our parsing data structures — that means automaton and run-rewriting rules both — as strings, get them past the Agda type checker, and then decode at runtime to get our parser.  It is gross, it is nasty, but it might just work.  Of course, no one wants to decode their parsing data structures every time a string is parsed, but that was the price I decided I’d be willing to pay to get my parsers running in Agda.

I spent a month or so — while learning to care, with my wife, for a very fussy newborn — implementing this.  I finally had code in gratr to dump all the data structures as strings, and code in Agda to decode those strings and plug them into my existing parsing-by-rewriting infrastructure.  Agda could type check the file containing the strings in a second or two, even when the strings were huge (megabytes-long files, due to the unfortunately large automata my approach is currently producing).  The moment of truth arrives: let us actually compile the entire shebang to an executable (not just type check that one file).  Agda type-checking chokes.  I cannot believe what I am seeing.  What is happening?  I can type check the files containing the string-encoded data structures almost instantly, but type-checking the wrapper file defining the main entry point (which is just based off the way Haskell sets up code for compilation to executables) is running seemingly forever.  A little quick experimentation reveals: big strings encoding the data structures makes type checking that file super slow.  What gives!  Further head scratching leads me to suspect that for some reason, when Agda is instantiating a module somewhere in my setup, it is actually trying to normalize the fields of a record, where those fields are calling the decode functions on the string encodings.  This is the step that could take a second or two at runtime, with ghc-optimized executables, but will likely take forever with Agda’s compile-time interpreter.  How to make Agda stop doing this?

Here’s a hack for this:

 postulate
 runtime-identity : ∀{A : Set} → A → A
{-# COMPILED runtime-identity (\ _ x -> x ) #-}

The idea is that we will interpose the postulate called runtime-identity to block the redex of decoder applied to string encoding.  At compile time, Agda knows nothing about runtime-identity, and hence will not be able to reduce that redex.  But at runtime, runtime-identity will be compiled to the identity function in Haskell, and hence the redex will reduce.

Delightfully, this worked.  Compiling the emitted parsers with the string encodings is now very quick, maybe 10 seconds to get all the way through Agda to Haskell to an executable.  Awesome!  Now let’s just run the executable.  Heh heh.  There is no way that could not work, right?

Wrong, of course.  Running the executable on a tiny input string takes 7 minutes and then I kill it.  Oh my gosh, I am thinking.  I just spent 5 weeks of precious coding time (in and around other duties, especially new childcare duties, with my patient wife looking on and wondering…) to get this gross hack working, and now the runtime performance is unacceptable.  I almost despair.

But hark! Reason calls.  It cannot be taking a ghc-optimized executable that long to decode my string-encoded data structures.  After all, encoding the data structures to strings from my OCaml gratr implementation is instantaneous.  Sure, decoding could be a bit longer, but forever longer?  That can’t be right.  So how can we figure out what is going on?

Easy: profile the compiled code with ghc’s profiling features.  Agda just compiles down to (almost unreadable) Haskell, which is then compiled by ghc, so we can just profile the code with ghc.  I had never used ghc’s profiler, but it was very simple to invoke and the results were easily understandable.  Where is the time going?  Here is the scary line:

d168 MAlonzo.Code.QnatZ45Zthms 322 7928472 10.9 36.0 55.0 53.5

The last numbers are showing that over half the time of the executable is going into function d168 in nat-thms.agda.  A function in nat-thms.agda?  That contains a bunch of lemmas and theorems about natural-number operations.  I hardly expect my parser to be grunting away there.  What is d168?  Well, it is the Agda-emitted version of this lemma:

<-drop : ∀ {x y : ℕ} → (x < (suc y) ≡ tt) → x ≡ y ∨ x < y ≡ tt

This function looks to take linear time in the size of x, which could be the length of the emitted string encoding in this case.  Where on earth is this called from?  And why is its evaluation getting forced anyway in Haskell’s lazy evaluation model?  <-drop is getting called in

wf-< : ∀ (n : ℕ) → WfStructBool _<_ n

This is the proof that the _<_ ordering on natural numbers is well-founded.  The string-decoding functions have to use well-founded recursion for Agda to see they are terminating.  You recursively decode some part of the string, and then need to continue on the residual part of the string that has not been decoded yet, which is returned by your recursive call.  Agda cannot see that the residual you are recursing on is a subterm of the starting input string, so it cannot confirm the function is structurally terminating.  The solution is to use well-founded recursion.  And this is taking, as far as I can tell, time quadratic in the size of the input string to be decoded.  These strings are long, so a quadratic time operation (with lots of recursion and pattern matching) is going to kill us.

What is the solution?  Strip out the well-founded recursion and just disable Agda’s termination checker.  I do this, cross my fingers, compile, run, and … it works!  Hot diggety.

So those are the three performance problems we tackled here in Agda: slow type checking (just avoid the type checker altogether by encoding big data structures as strings and decoding at runtime), unwanted compile-time evaluation (interpose postulated runtime-identity to block the redexes), and super slow well-founded recursion (punt and disable the termination checker).  I am interested in any similar experiences readers may have had….

I will be posting again about lambda-encoding data, but I wanted to share one interesting experience first, applying a fuzzer to our versat verified SAT solver, primarily implemented by Duckki Oe, who is, by the way, currently looking for postdocs or possibly industrial verification positions.  This solver took Duckki, with some contributions from me, Corey Oliver, and Kevin Clancy, 2 years to implement.  Along the way, he did find and fix a couple of bugs in parts of the solver we were not verifying.   For example, at some point there was a bug in watched lists, so new clauses were never indexed. That affected performance behavior, and took a while to detect.  But it was not needed for UNSAT-soundness (that is, correctness of UNSAT answers, which is what we verified).  Also, Duckki debugged some SAT-unsoundness issues, which were (easily) detected by model checking at the end of solving.  So writing verified code does not mean there is no debugging to do.

But!

Certainly for what we verified we have not found a bug in evaluating versat.  Such a bug would represent an error in our specification, which is short (consisting of the definition of propositional resolution and a parser for CNF formulas in DIMACS format); or else a bug in the Guru compiler or logic.  The latter is not out of the question, but unlikely to manifest itself as accepting an invalid proof.   And just today I followed the lead of John Regehr in actually trying to find bugs in a piece of code that has been proven correct.  After all, versat is UNSAT-sound in theory, but maybe not in practice!  John did this for the CompCert verified optimizing C compiler, and after expending 6 compute years on searching for bugs, could not find even one in the verified core of the compiler (but several in the parser and similar unverified parts).  I used the cnfuzz tool by Robert Brummayer, Florian Lonsing, and Armin Biere.  I wrote a simple wrapper script that would call cnfuzz to generate random nasty CNF benchmarks to feed to versat, and also to MiniSat 2.2.0.  Then I just compared their answers.  I ran 10,000 (different) benchmarks this way (way less of an evaluation than John reported for CompCert).  Of these, versat timed out on 27 with a 60 second time limit.  For all the rest, the answers are the same for the two tools.

Let me try to emphasize just how remarkable this is.  A modern SAT solver is not a huge amount of code (typically), but it is highly optimized and very tricky.  I would expect quite a bit of testing before such a tool written in a conventional language is reliable.  For example, way back in the day when I was a grad student working on the CVC SMT solver (first version, with subsequent versions implemented by Clark Barrett and his students and my colleague Cesare Tinelli and his students), I remember many fond hours (days, weeks) spent with pieces of paper taped together covered in handwritten annotated formulas as I tried to find why the solver was giving the wrong answer in some case.  Now, an SMT solver is truly a lot more complicated than a SAT solver, and usually with a lot bigger codebase.   But still, I would fully expect similar pain in getting a modern SAT solver up and running reliably.

For versat, modulo a couple of bug fixes earlier in development for unverified parts of the solver, it works perfectly right out of the gate.  We finally get it through the type checker after 6 months of proving lemmas about its data structures and algorithms, and it is completely correct, at least as testing on 10,000 random benchmarks reveals compared to MiniSat.  Notice, it is not just the UNSAT answers that are right.  The SAT ones are right, too.  I have noticed this in developing real verified software that you actually intend to run (as opposed to just formalize, prove correct, and discard).  Verifying some properties tends to increase overall correctness, even for properties you don’t verify.  The verification is so ridiculously hard and requires such deep involvement in what the tool does, that you just end up with a higher quality piece of code in the end — if you get something working at all, which is a very difficult proposition, requiring highly skilled engineering talent (did I mention Duckki is looking for postdoc and industrial verification positions).

So, in the end, I feel this is great confirmation — and it was a great idea of John Regehr’s to do this in general — of the quality of a verified piece of software.  Without extensive testing and debugging, versat exactly matches MiniSat’s behavior on 10,000 benchmarks (modulo 0.27% timeouts) generated by cnfuzz.  This is what a verified-programming proponent dreams of: you write your code, you verify it with blood sweat and tears, and it just works perfectly at the end.  It is a great feeling, and one for which I am looking for additional interesting case studies to repeat.

Mild neglect of duty continues here in favor of performance evaluation, which I was discussing in my previous post.  Here are a couple more useful things I found:

  • Using fgetc_unlocked() cut my processing time by 25% parsing a 120MB proof.
  • Using a raw C array of characters instead of an STL string to hold the current token cut another 25%.

These two simple changes have got my times down on a 1GB proof to 45s user time, which I am quite happy with.  Watching what is happening using “top”, CPU utilization is around 50% parsing this proof, which I interpret to mean that the process is IO-bound now.  So on my laptop, at any rate, I’m reaching the limit of how quickly I can tear through this big proof on disk.

As I noted in comments to the previous post, pulling in bigger chunks of text from disk using either fread() or mmap() does not improve on fgetc().

Hope everyone is having a pleasant and/or productive weekend.