Skip navigation

I will be posting again about lambda-encoding data, but I wanted to share one interesting experience first, applying a fuzzer to our versat verified SAT solver, primarily implemented by Duckki Oe, who is, by the way, currently looking for postdocs or possibly industrial verification positions.  This solver took Duckki, with some contributions from me, Corey Oliver, and Kevin Clancy, 2 years to implement.  Along the way, he did find and fix a couple of bugs in parts of the solver we were not verifying.   For example, at some point there was a bug in watched lists, so new clauses were never indexed. That affected performance behavior, and took a while to detect.  But it was not needed for UNSAT-soundness (that is, correctness of UNSAT answers, which is what we verified).  Also, Duckki debugged some SAT-unsoundness issues, which were (easily) detected by model checking at the end of solving.  So writing verified code does not mean there is no debugging to do.

But!

Certainly for what we verified we have not found a bug in evaluating versat.  Such a bug would represent an error in our specification, which is short (consisting of the definition of propositional resolution and a parser for CNF formulas in DIMACS format); or else a bug in the Guru compiler or logic.  The latter is not out of the question, but unlikely to manifest itself as accepting an invalid proof.   And just today I followed the lead of John Regehr in actually trying to find bugs in a piece of code that has been proven correct.  After all, versat is UNSAT-sound in theory, but maybe not in practice!  John did this for the CompCert verified optimizing C compiler, and after expending 6 compute years on searching for bugs, could not find even one in the verified core of the compiler (but several in the parser and similar unverified parts).  I used the cnfuzz tool by Robert Brummayer, Florian Lonsing, and Armin Biere.  I wrote a simple wrapper script that would call cnfuzz to generate random nasty CNF benchmarks to feed to versat, and also to MiniSat 2.2.0.  Then I just compared their answers.  I ran 10,000 (different) benchmarks this way (way less of an evaluation than John reported for CompCert).  Of these, versat timed out on 27 with a 60 second time limit.  For all the rest, the answers are the same for the two tools.

Let me try to emphasize just how remarkable this is.  A modern SAT solver is not a huge amount of code (typically), but it is highly optimized and very tricky.  I would expect quite a bit of testing before such a tool written in a conventional language is reliable.  For example, way back in the day when I was a grad student working on the CVC SMT solver (first version, with subsequent versions implemented by Clark Barrett and his students and my colleague Cesare Tinelli and his students), I remember many fond hours (days, weeks) spent with pieces of paper taped together covered in handwritten annotated formulas as I tried to find why the solver was giving the wrong answer in some case.  Now, an SMT solver is truly a lot more complicated than a SAT solver, and usually with a lot bigger codebase.   But still, I would fully expect similar pain in getting a modern SAT solver up and running reliably.

For versat, modulo a couple of bug fixes earlier in development for unverified parts of the solver, it works perfectly right out of the gate.  We finally get it through the type checker after 6 months of proving lemmas about its data structures and algorithms, and it is completely correct, at least as testing on 10,000 random benchmarks reveals compared to MiniSat.  Notice, it is not just the UNSAT answers that are right.  The SAT ones are right, too.  I have noticed this in developing real verified software that you actually intend to run (as opposed to just formalize, prove correct, and discard).  Verifying some properties tends to increase overall correctness, even for properties you don’t verify.  The verification is so ridiculously hard and requires such deep involvement in what the tool does, that you just end up with a higher quality piece of code in the end — if you get something working at all, which is a very difficult proposition, requiring highly skilled engineering talent (did I mention Duckki is looking for postdoc and industrial verification positions).

So, in the end, I feel this is great confirmation — and it was a great idea of John Regehr’s to do this in general — of the quality of a verified piece of software.  Without extensive testing and debugging, versat exactly matches MiniSat’s behavior on 10,000 benchmarks (modulo 0.27% timeouts) generated by cnfuzz.  This is what a verified-programming proponent dreams of: you write your code, you verify it with blood sweat and tears, and it just works perfectly at the end.  It is a great feeling, and one for which I am looking for additional interesting case studies to repeat.

2 Comments

  1. This is a great study. I think it’s definitely the case that verification increases software quality in ways unrelated to the verified properties, just by forcing more precise specification and keeping us from lying to ourselves.

    Since fuzzing is relatively easy, it always needs to be done, for systems whose correctness we care about. The interesting question is: how should we do fuzzing differently, when fuzzing a verified system? I have some random ideas but it’s not an easy question.

    I think our total CompCert bug count was 13. About half were what I’d call specification bugs, the others were implementation errors in unverified parts. SAT happens to have a particularly concise specification but not all systems are so lucky!

  2. Hi, John. Thanks for the comments! That is a very interesting thought, that maybe we should fuzz verified systems differently. I’m looking forward to talking with you more about this when you visit next week. And yes, I chose SAT solving as a case study in language-based verification partly because it has a simple spec for the whole program. Other good characteristics for verification:

    — it is a relatively important piece of software, particularly for something of its size (couple thousand lines of C, say).
    — performance is critical
    — there are many other implementations, so one can compare against them
    — the things we chose to verify were nontrivial, but not herculeanly difficult (cf. seL4).


Leave a reply to johnregehr Cancel reply