Skip navigation

Category Archives: Coding

It has been a very long time since I posted here — life has been busy, including a new baby at home.  But I really want to share about my recent experience tackling several performance problems in Agda.  Agda, as I hope you know already, is a very elegant dependently typed pure functional programming language.  It supports Unicode, so you can write → instead of -> for function space; and many other cool symbols.  It has user-defined mixfix notation, so you can define if_then_else_ (you write underscores in Agda to show where the arguments go) with the expected syntax.  It compiles to Haskell, although I get the impression that many people just use Agda as an advanced type checker, and do not bother compiling executables.  Agda has very good inference for implicit arguments, which can help make code shorter and more readable.  It also features definition of terminating functions by recursive equations.  So you can write beautiful definitions like the following for infix vector append:

_++𝕍_ : ∀ {ℓ} {A : Set ℓ}{n m : ℕ} → 𝕍 A n → 𝕍 A m → 𝕍 A (n + m)
[] ++𝕍 ys = ys
(x :: xs) ++𝕍 ys = x :: (xs ++𝕍 ys)

You might object to naming the function _++𝕍_ instead of _++_, but Agda does not support type classes or other approaches to operator overloading, and I prefer never to have to worry about symbols clashing from different included files.  This definition is from my standard library (not the Agda standard library; if prompted for username and password, enter “guest” for both).  There is a very nice emacs mode, too.

With all these great features, working well together, Agda provides just about the most elegant programming experience I have had in 21 years of coding.  I think it is a fantastic language, with much to emulate and learn from.  These accolades aside, my purpose in this post is to discuss some grotesque workarounds for performance problems inherent in the implementation and maybe even the language.  To be direct: Agda’s type checker has abysmal performance.  Suppose we create an Agda source file defining test to be a list containing 3000 copies of boolean true.  Agda takes 12.5 seconds to type check this file on my laptop.  If we give the same example to OCaml, it takes 0.2 seconds, in other words, heading towards two orders of magnitude slower.  Now, is Agda’s type checker doing fancier things than OCaml’s?  Undoubtedly.  But not on this example!  I am willing to accept some overhead in general for fancier type checking even on code that does not use fancy types.  And OCaml has been around for quite some time and is engineered by one of the best language implementors on the planet.  Fine.  So let Agda be 2 times slower.  Let it be 5 times slower.  But 60 times slower?  That is not good.

I ran into the performance issues with Agda’s type checker while tilting at the windmill I’ve been dueling off and on the past three years: parsing by rewriting.  Without going into detail here, let me just say I’ve been working, with very patient colleague Nao Hirokawa, to create a new approach to parsing based on term rewriting.  The idea as it stands now is that one runs an automaton approximating the language of your CFG over the input string, to create a formal artifact called a run.  Then one applies confluent run-rewriting rules to this run, where those rules are derived from the productions of the grammar and will rewrite every string in the language to a term built from the start symbol and then containing the parse tree as a subterm.  I love the approach, because it is inherently parallelizable (because the run-rewriting rules are confluent), and because we can resolve ambiguities in the grammar by adding rewrite rules.  The trickiest part is coming up with the right approximating automaton, and this is still not at all ready for prime time (despite the fact that I inflicted it on my undergraduate class this semester).

Anyhow, since I have been teaching undergrad PL this semester using Agda (more on this in a later post), and since Agda does not have a parser generator (perhaps because FPers seem to prefer parser combinators), I decided I would make my parsing-by-rewriting tool, called gratr, target Agda as a backend.  After a fair bit of hacking I had this working, only to discover that for even pretty small grammars, I was generating several thousand line Agda source files, which Agda absolutely could not handle.  Imagine my disappointment!  Beautiful (if not yet ready for big grammars) approach to parsing, targeting Agda, and the Agda type checker could not handle my generated parsers’ source files, except for the tiniest of grammars.  I was depressed, and determined to find a way around this problem to get working parsers from the medium-small grammars I wanted to use for class, as well as research.

Enter –use-string-encoding.  A little experimentation revealed that while Agda chokes checking even very simple terms when they get even moderately big, it will affirm that double-quoted strings indeed have type string in time seemingly independent of string size.  Oh the fiendery.  Let us encode all our parsing data structures — that means automaton and run-rewriting rules both — as strings, get them past the Agda type checker, and then decode at runtime to get our parser.  It is gross, it is nasty, but it might just work.  Of course, no one wants to decode their parsing data structures every time a string is parsed, but that was the price I decided I’d be willing to pay to get my parsers running in Agda.

I spent a month or so — while learning to care, with my wife, for a very fussy newborn — implementing this.  I finally had code in gratr to dump all the data structures as strings, and code in Agda to decode those strings and plug them into my existing parsing-by-rewriting infrastructure.  Agda could type check the file containing the strings in a second or two, even when the strings were huge (megabytes-long files, due to the unfortunately large automata my approach is currently producing).  The moment of truth arrives: let us actually compile the entire shebang to an executable (not just type check that one file).  Agda type-checking chokes.  I cannot believe what I am seeing.  What is happening?  I can type check the files containing the string-encoded data structures almost instantly, but type-checking the wrapper file defining the main entry point (which is just based off the way Haskell sets up code for compilation to executables) is running seemingly forever.  A little quick experimentation reveals: big strings encoding the data structures makes type checking that file super slow.  What gives!  Further head scratching leads me to suspect that for some reason, when Agda is instantiating a module somewhere in my setup, it is actually trying to normalize the fields of a record, where those fields are calling the decode functions on the string encodings.  This is the step that could take a second or two at runtime, with ghc-optimized executables, but will likely take forever with Agda’s compile-time interpreter.  How to make Agda stop doing this?

Here’s a hack for this:

 postulate
 runtime-identity : ∀{A : Set} → A → A
{-# COMPILED runtime-identity (\ _ x -> x ) #-}

The idea is that we will interpose the postulate called runtime-identity to block the redex of decoder applied to string encoding.  At compile time, Agda knows nothing about runtime-identity, and hence will not be able to reduce that redex.  But at runtime, runtime-identity will be compiled to the identity function in Haskell, and hence the redex will reduce.

Delightfully, this worked.  Compiling the emitted parsers with the string encodings is now very quick, maybe 10 seconds to get all the way through Agda to Haskell to an executable.  Awesome!  Now let’s just run the executable.  Heh heh.  There is no way that could not work, right?

Wrong, of course.  Running the executable on a tiny input string takes 7 minutes and then I kill it.  Oh my gosh, I am thinking.  I just spent 5 weeks of precious coding time (in and around other duties, especially new childcare duties, with my patient wife looking on and wondering…) to get this gross hack working, and now the runtime performance is unacceptable.  I almost despair.

But hark! Reason calls.  It cannot be taking a ghc-optimized executable that long to decode my string-encoded data structures.  After all, encoding the data structures to strings from my OCaml gratr implementation is instantaneous.  Sure, decoding could be a bit longer, but forever longer?  That can’t be right.  So how can we figure out what is going on?

Easy: profile the compiled code with ghc’s profiling features.  Agda just compiles down to (almost unreadable) Haskell, which is then compiled by ghc, so we can just profile the code with ghc.  I had never used ghc’s profiler, but it was very simple to invoke and the results were easily understandable.  Where is the time going?  Here is the scary line:

d168 MAlonzo.Code.QnatZ45Zthms 322 7928472 10.9 36.0 55.0 53.5

The last numbers are showing that over half the time of the executable is going into function d168 in nat-thms.agda.  A function in nat-thms.agda?  That contains a bunch of lemmas and theorems about natural-number operations.  I hardly expect my parser to be grunting away there.  What is d168?  Well, it is the Agda-emitted version of this lemma:

<-drop : ∀ {x y : ℕ} → (x < (suc y) ≡ tt) → x ≡ y ∨ x < y ≡ tt

This function looks to take linear time in the size of x, which could be the length of the emitted string encoding in this case.  Where on earth is this called from?  And why is its evaluation getting forced anyway in Haskell’s lazy evaluation model?  <-drop is getting called in

wf-< : ∀ (n : ℕ) → WfStructBool _<_ n

This is the proof that the _<_ ordering on natural numbers is well-founded.  The string-decoding functions have to use well-founded recursion for Agda to see they are terminating.  You recursively decode some part of the string, and then need to continue on the residual part of the string that has not been decoded yet, which is returned by your recursive call.  Agda cannot see that the residual you are recursing on is a subterm of the starting input string, so it cannot confirm the function is structurally terminating.  The solution is to use well-founded recursion.  And this is taking, as far as I can tell, time quadratic in the size of the input string to be decoded.  These strings are long, so a quadratic time operation (with lots of recursion and pattern matching) is going to kill us.

What is the solution?  Strip out the well-founded recursion and just disable Agda’s termination checker.  I do this, cross my fingers, compile, run, and … it works!  Hot diggety.

So those are the three performance problems we tackled here in Agda: slow type checking (just avoid the type checker altogether by encoding big data structures as strings and decoding at runtime), unwanted compile-time evaluation (interpose postulated runtime-identity to block the redexes), and super slow well-founded recursion (punt and disable the termination checker).  I am interested in any similar experiences readers may have had….

I will be posting again about lambda-encoding data, but I wanted to share one interesting experience first, applying a fuzzer to our versat verified SAT solver, primarily implemented by Duckki Oe, who is, by the way, currently looking for postdocs or possibly industrial verification positions.  This solver took Duckki, with some contributions from me, Corey Oliver, and Kevin Clancy, 2 years to implement.  Along the way, he did find and fix a couple of bugs in parts of the solver we were not verifying.   For example, at some point there was a bug in watched lists, so new clauses were never indexed. That affected performance behavior, and took a while to detect.  But it was not needed for UNSAT-soundness (that is, correctness of UNSAT answers, which is what we verified).  Also, Duckki debugged some SAT-unsoundness issues, which were (easily) detected by model checking at the end of solving.  So writing verified code does not mean there is no debugging to do.

But!

Certainly for what we verified we have not found a bug in evaluating versat.  Such a bug would represent an error in our specification, which is short (consisting of the definition of propositional resolution and a parser for CNF formulas in DIMACS format); or else a bug in the Guru compiler or logic.  The latter is not out of the question, but unlikely to manifest itself as accepting an invalid proof.   And just today I followed the lead of John Regehr in actually trying to find bugs in a piece of code that has been proven correct.  After all, versat is UNSAT-sound in theory, but maybe not in practice!  John did this for the CompCert verified optimizing C compiler, and after expending 6 compute years on searching for bugs, could not find even one in the verified core of the compiler (but several in the parser and similar unverified parts).  I used the cnfuzz tool by Robert Brummayer, Florian Lonsing, and Armin Biere.  I wrote a simple wrapper script that would call cnfuzz to generate random nasty CNF benchmarks to feed to versat, and also to MiniSat 2.2.0.  Then I just compared their answers.  I ran 10,000 (different) benchmarks this way (way less of an evaluation than John reported for CompCert).  Of these, versat timed out on 27 with a 60 second time limit.  For all the rest, the answers are the same for the two tools.

Let me try to emphasize just how remarkable this is.  A modern SAT solver is not a huge amount of code (typically), but it is highly optimized and very tricky.  I would expect quite a bit of testing before such a tool written in a conventional language is reliable.  For example, way back in the day when I was a grad student working on the CVC SMT solver (first version, with subsequent versions implemented by Clark Barrett and his students and my colleague Cesare Tinelli and his students), I remember many fond hours (days, weeks) spent with pieces of paper taped together covered in handwritten annotated formulas as I tried to find why the solver was giving the wrong answer in some case.  Now, an SMT solver is truly a lot more complicated than a SAT solver, and usually with a lot bigger codebase.   But still, I would fully expect similar pain in getting a modern SAT solver up and running reliably.

For versat, modulo a couple of bug fixes earlier in development for unverified parts of the solver, it works perfectly right out of the gate.  We finally get it through the type checker after 6 months of proving lemmas about its data structures and algorithms, and it is completely correct, at least as testing on 10,000 random benchmarks reveals compared to MiniSat.  Notice, it is not just the UNSAT answers that are right.  The SAT ones are right, too.  I have noticed this in developing real verified software that you actually intend to run (as opposed to just formalize, prove correct, and discard).  Verifying some properties tends to increase overall correctness, even for properties you don’t verify.  The verification is so ridiculously hard and requires such deep involvement in what the tool does, that you just end up with a higher quality piece of code in the end — if you get something working at all, which is a very difficult proposition, requiring highly skilled engineering talent (did I mention Duckki is looking for postdoc and industrial verification positions).

So, in the end, I feel this is great confirmation — and it was a great idea of John Regehr’s to do this in general — of the quality of a verified piece of software.  Without extensive testing and debugging, versat exactly matches MiniSat’s behavior on 10,000 benchmarks (modulo 0.27% timeouts) generated by cnfuzz.  This is what a verified-programming proponent dreams of: you write your code, you verify it with blood sweat and tears, and it just works perfectly at the end.  It is a great feeling, and one for which I am looking for additional interesting case studies to repeat.

Mild neglect of duty continues here in favor of performance evaluation, which I was discussing in my previous post.  Here are a couple more useful things I found:

  • Using fgetc_unlocked() cut my processing time by 25% parsing a 120MB proof.
  • Using a raw C array of characters instead of an STL string to hold the current token cut another 25%.

These two simple changes have got my times down on a 1GB proof to 45s user time, which I am quite happy with.  Watching what is happening using “top”, CPU utilization is around 50% parsing this proof, which I interpret to mean that the process is IO-bound now.  So on my laptop, at any rate, I’m reaching the limit of how quickly I can tear through this big proof on disk.

As I noted in comments to the previous post, pulling in bigger chunks of text from disk using either fread() or mmap() does not improve on fgetc().

Hope everyone is having a pleasant and/or productive weekend.

The end of the semester here at U. Iowa is plenty busy, but (or perhaps consequently?) I have been doing some coding on our OCaml implementation of LFSC (“Logical Framework with Side Conditions”), a descendant of the Edinburgh LF that we are proposing as a standard metalanguage for proofs produced by SMT solvers.  This compiler takes a signature as input — where a signature just defines the syntax and proof rules for your logic, logical theory, or other formal system of interest — and produces C++ code for checking proofs expressed in that logic.  Our previous version of this compiler was written in C++.  It was quite complicated, somewhat difficult to maintain, and hard to use as a basis for many of our optimizations (because we were compiling signatures to C++ code that was just extending the C++ code for the compiler itself).  So we (Jed McClurg, Cuong Thai, and myself) have been implementing a new compiler for LFSC in OCaml this semester (with important input and advice from Cesare Tinelli and Andy Reynolds — see clc.cs.uiowa.edu for more about our group).

Anyhow, writing this in OCaml is so much cleaner and nicer, mostly because OCaml is just much better for writing symbolic tools, but partly also because we have learned a bunch from our previous experience.  In particular, we have trimmed down LFSC quite a bit, to focus it on its intended application.  Also, we know better how to implement things efficiently in the emitted C++ proof checker.  For example, take parsing.  The approach we used in the C++ LFSC compiler, and are working to implement in the OCaml version, is called incremental checking.  The C++ proof checker (emitted by the LFSC compiler) parses and checks the proof in an interleaved fashion, as opposed to parsing first to an abstract syntax tree, and then checking.  This approach is about 5x faster in our experience, partly because it uses much less memory.  It can therefore also handle much bigger proofs.

To get to the title of this post: as a warm-up for emitting incremental checking code from the OCaml LFSC compiler, I have been implementing something that just parses proofs, without checking them (or building ASTs for them).  The prefix syntax of LFSC proofs is very simple: the first symbol of any proof uniquely determines what kind of proof it is.  So we could just write a recursive descent parser.  But that would use C’s stack to keep track of what our pending parsing tasks are.  It is much more efficient to use a stack data structure in the language.  And there we come to what I wanted to post about this evening: C++ stacks and their inefficiencies.

But first, let’s make sure the algorithm is reasonably clear.  You are trying to parse prefix expressions, where the operator (which comes first) uniquely determines which subexpressions should be parsed next, in order to parse an expression beginning with that operator.  For a simple example, here are some first-order logic formulas in prefix notation:

formula f ::= true | false | and f1 f2 | imp f1 f2 | or f1 f2 | forall i ^ f.

The caret notation for forall is binding notation (and “i” is an individual, from another syntactic category), which LFSC supports (but which we can ignore for the discussion about C++ stacks).  A nice efficient way to parse formulas like these is to keep a stack of pending work to be done.  Each entry in the stack should hold an operator (or more efficiently, operator tag as an integer), and the number of arguments parsed so far.  Depending on how many arguments have been parsed, one either parses the next argument from the appropriate syntactic category, or else does a “reduce”, which pops the stack and increments the number at the top of the resulting stack (since we have just parsed one more subexpression for the expression that is now indicated by the top of the stack).

So in the end, we need stacks.  C++ has stacks: just use std::stack<int>, say, from the Standard Template Library, for your stack of ints holding the number parsed (and another for the operator tag).  So I implemented compilation from signatures to C++ code to parse proofs (and also syntactic expressions like the formulas listed above), using STL stacks.  The performance was somewhat disappointing: around 89s (wallclock) to parse a 1GB proof (averaged across three runs), on my somewhat oldish Latitude D420 laptop.  I profiled the emitted code, and found that an inordinate amount of time was going into stack operations, which seemed suspicious.  According to gprof on the unoptimized executable — and in my experience gprof’s time estimates are often unreliable, though call counts presumably are fine — something like 60% of the time was going into stack operations!  Admittedly, there are a lot of them: 1.6 billion calls to top(), for example.  But still, that was looking like an obvious optimization target.

STL stacks use another data structure to implement the underlying collection of stack entries, so I tried changing from deque’s (the default), to lists (much worse) and also vectors (slightly worse).  I had had bad experiences in the past with STL hashtables’ being sluggish, so I just quickly hacked my own little naive stack class.  No resizing, just a big array, push(), pop(), and top().  This reduced my time on the big proof to 79s (wallclock) on average, for a 12% speedup.  For user times, the speed up is less, more like 5%.  Here’s the code (very simple stuff — sorry the formatting is not coming through better here):

#ifndef LFSC_STACK_H
#define LFSC_STACK_H

#include <cstdlib>

template<typename Tp>
class Stack {

private:
int max_size;
int last;
Tp *arr;

public:
Stack(int _max_size) {
max_size = _max_size;
arr = new Tp[max_size];
last = -1;
}

inline void push(Tp t) {
last++;
if (last >= max_size) {
// should report error
exit(1);
}
arr[last] = t;
}

inline void pop() {
if (last < 0) {
// should report error
exit(1);
}
last–;
}

inline Tp &top() {
if (last < 0) {
// should report error
exit(1);
}
return arr[last];
}

inline int size() {
return last+1;
}

};

#endif

So I am a little disappointed in the STL data structure, although I acknowledge it is doing more than my simple implementation.  Nevertheless, this is the point of my post here: if you are hacking in C++ and looking for performance, be wary of the STL containers.  In my experience, you can usually beat them, especially if you customize a little for your application.

Follow

Get every new post delivered to your Inbox.