Skip navigation

The end of the semester here at U. Iowa is plenty busy, but (or perhaps consequently?) I have been doing some coding on our OCaml implementation of LFSC (“Logical Framework with Side Conditions”), a descendant of the Edinburgh LF that we are proposing as a standard metalanguage for proofs produced by SMT solvers.  This compiler takes a signature as input — where a signature just defines the syntax and proof rules for your logic, logical theory, or other formal system of interest — and produces C++ code for checking proofs expressed in that logic.  Our previous version of this compiler was written in C++.  It was quite complicated, somewhat difficult to maintain, and hard to use as a basis for many of our optimizations (because we were compiling signatures to C++ code that was just extending the C++ code for the compiler itself).  So we (Jed McClurg, Cuong Thai, and myself) have been implementing a new compiler for LFSC in OCaml this semester (with important input and advice from Cesare Tinelli and Andy Reynolds — see clc.cs.uiowa.edu for more about our group).

Anyhow, writing this in OCaml is so much cleaner and nicer, mostly because OCaml is just much better for writing symbolic tools, but partly also because we have learned a bunch from our previous experience.  In particular, we have trimmed down LFSC quite a bit, to focus it on its intended application.  Also, we know better how to implement things efficiently in the emitted C++ proof checker.  For example, take parsing.  The approach we used in the C++ LFSC compiler, and are working to implement in the OCaml version, is called incremental checking.  The C++ proof checker (emitted by the LFSC compiler) parses and checks the proof in an interleaved fashion, as opposed to parsing first to an abstract syntax tree, and then checking.  This approach is about 5x faster in our experience, partly because it uses much less memory.  It can therefore also handle much bigger proofs.

To get to the title of this post: as a warm-up for emitting incremental checking code from the OCaml LFSC compiler, I have been implementing something that just parses proofs, without checking them (or building ASTs for them).  The prefix syntax of LFSC proofs is very simple: the first symbol of any proof uniquely determines what kind of proof it is.  So we could just write a recursive descent parser.  But that would use C’s stack to keep track of what our pending parsing tasks are.  It is much more efficient to use a stack data structure in the language.  And there we come to what I wanted to post about this evening: C++ stacks and their inefficiencies.

But first, let’s make sure the algorithm is reasonably clear.  You are trying to parse prefix expressions, where the operator (which comes first) uniquely determines which subexpressions should be parsed next, in order to parse an expression beginning with that operator.  For a simple example, here are some first-order logic formulas in prefix notation:

formula f ::= true | false | and f1 f2 | imp f1 f2 | or f1 f2 | forall i ^ f.

The caret notation for forall is binding notation (and “i” is an individual, from another syntactic category), which LFSC supports (but which we can ignore for the discussion about C++ stacks).  A nice efficient way to parse formulas like these is to keep a stack of pending work to be done.  Each entry in the stack should hold an operator (or more efficiently, operator tag as an integer), and the number of arguments parsed so far.  Depending on how many arguments have been parsed, one either parses the next argument from the appropriate syntactic category, or else does a “reduce”, which pops the stack and increments the number at the top of the resulting stack (since we have just parsed one more subexpression for the expression that is now indicated by the top of the stack).

So in the end, we need stacks.  C++ has stacks: just use std::stack<int>, say, from the Standard Template Library, for your stack of ints holding the number parsed (and another for the operator tag).  So I implemented compilation from signatures to C++ code to parse proofs (and also syntactic expressions like the formulas listed above), using STL stacks.  The performance was somewhat disappointing: around 89s (wallclock) to parse a 1GB proof (averaged across three runs), on my somewhat oldish Latitude D420 laptop.  I profiled the emitted code, and found that an inordinate amount of time was going into stack operations, which seemed suspicious.  According to gprof on the unoptimized executable — and in my experience gprof’s time estimates are often unreliable, though call counts presumably are fine — something like 60% of the time was going into stack operations!  Admittedly, there are a lot of them: 1.6 billion calls to top(), for example.  But still, that was looking like an obvious optimization target.

STL stacks use another data structure to implement the underlying collection of stack entries, so I tried changing from deque’s (the default), to lists (much worse) and also vectors (slightly worse).  I had had bad experiences in the past with STL hashtables’ being sluggish, so I just quickly hacked my own little naive stack class.  No resizing, just a big array, push(), pop(), and top().  This reduced my time on the big proof to 79s (wallclock) on average, for a 12% speedup.  For user times, the speed up is less, more like 5%.  Here’s the code (very simple stuff — sorry the formatting is not coming through better here):

#ifndef LFSC_STACK_H
#define LFSC_STACK_H

#include <cstdlib>

template<typename Tp>
class Stack {

private:
int max_size;
int last;
Tp *arr;

public:
Stack(int _max_size) {
max_size = _max_size;
arr = new Tp[max_size];
last = -1;
}

inline void push(Tp t) {
last++;
if (last >= max_size) {
// should report error
exit(1);
}
arr[last] = t;
}

inline void pop() {
if (last < 0) {
// should report error
exit(1);
}
last–;
}

inline Tp &top() {
if (last < 0) {
// should report error
exit(1);
}
return arr[last];
}

inline int size() {
return last+1;
}

};

#endif

So I am a little disappointed in the STL data structure, although I acknowledge it is doing more than my simple implementation.  Nevertheless, this is the point of my post here: if you are hacking in C++ and looking for performance, be wary of the STL containers.  In my experience, you can usually beat them, especially if you customize a little for your application.

Advertisements

10 Comments

  1. What made you choose C++ in the first place? Performance? Why not use a functional language like Haskell?

    Harley

    • Hi, Harley. Yes, the main reason is performance. We need these proof checkers to be as fast as we know how to make them. There’s nothing inherent in functional programming that would prevent such a language from being up to the job, but most functional PLs rely on garbage collection for memory management. It’s my experience that we can usually beat such languages if we manage our own memory, which we will do in the C++ proof checkers produced by the LFSC compiler.

      Aaron

  2. Interesting. I just tried it for myself, and indeed your code came out significantly faster: 0.23s versus 0.26s for a microbenchmark that does a hundred million “push,top,pop”. I guess all the “stuff” in the STL code makes it harder to compile: the “simple” version gets completely inlined into a handfull of instructions, while the STL code calls out to a complicated insert method. In order to get optimal performance out of the STL version, gcc would probably have to guess that the fast path, where no resizing is needed, will be taken, and inline only that.

    Phrased that way, that seems like an ideal problem for those tracing compilers. Will the day come when Javascript beats C++? 🙂

    • Hi, Vilhelm. Thanks for the comment and the further experimentation. So I take it you saw that the emitted assembly code for the STL version got less optimized than for the simple version? That is interesting. I am thinking that even adding resizing to the simple stack should not be so bad, because in the common case, we will not take the path where we go off and resize the array. So yes, if we do have to do a resize, of course we expect a major slowdown at that point, but if we never need to resize, will we pay a big price for checking if we need to do so?

      I tried another little experiment, where I just comment out the error-checking if-statements in the stack functions. It seems to improve the time a little. I’ll give the numbers in user times for my experiments, since system times seem a bit erratic (what with having to read a 1GB file from disk). With the STL stack, 73s; my simple stack, 69s; and simple stack with no error checking, 67s. The hardware branch predictor may not be helping much in the error-checking case, because one of the values being compared (the index of the last element) is changing all the time. Don’t hardware branch predictors usually key off the program counter and also the values involved in the comparison?

      So the speed up is a bit more modest if I look at just user times.

      By the way, the profiling information I reported is for unoptimized executables. When the optimizer is turned on, everything is inlined, so it is hard to see what is happening.

      And yes, dynamic optimization could maybe help with this problem: although talk about increasing your trusted computing base!

    • Mikhail Glushenkov
    • Posted December 8, 2010 at 6:08 pm
    • Permalink
    • Reply

    Try preallocating a max_size vector with reserve() and using push_back/pop_back/back instead of push/pop/top.

    • Hi, Mikhail. Ok, I tried out your suggestion. I used stl::vector for my stacks, and reserved (using the reserve method) the max size of a stack for each. Honestly, I was expecting this to work pretty well — but it does not. The wallclock time (optimized executable, as for the other numbers I reported) is 122s, with user time 106s. So this is way worse than stl::stack and my Stack class.

      Thanks again for the idea, as it did seem promising.

      Aaron

      • Actually, hold the phone — my testing framework is behaving weirdly. I need to confirm what I just wrote.

        Aaron

  3. Hi, Mikhail. I did think about that idea, but somehow did not try it. I’ll give it a whirl (probably tomorrow) and report back.

    Thanks for the suggestion,
    Aaron

  4. Ok, I did try this once more from the beginning. And in the end the numbers are:

    1. my very simple stack: 70s (user time).
    2. STL stack: 74s (user time).
    3. STL vector: 71s (user time).
    4. plain arrays on the stack: 71s (user time).

    Now, I’m not sure I want to draw as big a conclusion as I was suggesting in my post based on these numbers, and I’m also a bit worried about my measurement setup and such.

    But the main target at this point is probably to improve speed just reading the large file into memory: gprof thinks just lexing the file is taking half my time. I am thinking about trying mmap().

  5. Final (I hope) addendum:

    I did try mmap, and it did not improve performance at all over just fgetc (or fread, for that matter). Specifically, I tried pulling the file into memory using mmap in 10MB chunks.

    Aaron


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: