Skip navigation

Break times for me are often times to do some coding, and this summer was no exception. I decided to start working on a new version of Cedile, dubbed Cedille Omega because the change to the core type theory is so large as to warrant skipping an infinite number of version numbers from Cedille 1. Ha. Naturally, I was expecting to have something working by the end of the summer, and I do! I can correctly process comments.

Oh my. Did I really spend a month or so on comments? Well, my work time is in a vise due to childcare under the pandemic, but still that seems extravagant. And there is more than just comment support, because I set up an emacs mode communicating with a backend written in Haskell, from scratch. Some things I did not know too well how to do, like set up a bigger Haskell codebase with cabal (with my own nested modules, for example). This was not hard but engineering takes time. The elisp side is more “interesting” because of dynamic typing and elisp weirdness.

But to comments: in Cedille Omega, there are three kinds of comments:

  • comment to end of line
  • comment to start of line
  • nestable comment

Comment to end of line begins with –, as in Haskell. It can also begin with -., to match comment to start of line, which ends with .- and is thus the mirror image. Nestable comments are with {- and -} as in Haskell.  I have not seen comment to start of line, but it seems like an obvious complement to comment to end of line.

I implemented, in the Haskell backend, what I am calling “blanking semantics” for comments. Blanking means to replace the whole comment, including start- and end-comment characters, with spaces. This is handy for indentation-sensitive parsing of non-comment text. There are some conflicts between the three kinds of comments, which are resolved by applying blanking to the different types of comment in order.  So first we blank comments to end of line.  This will erase start- and end-comment characters for all types of comments.  So if you have a line like

hi — there .-

the — will win out over the .-, and blanking will leave only the “hi”.

Finally, dangling start- or end-characters for nestable comments (which can indeed be nested, as in Haskell) comment to end of file or beginning of file, respectively.

This semantics was easy to implement in three passes in Haskell.  Because blanking comments to start of line seemed easier with the string reversed, my code just uses one core function to blank comments to end of line and reverse the string as a result.  So you blank -. comments getting a reversed string, then blank -. again which has the desired effect of blanking comments to start of line.  Then a recursive function blanks nestable comments still standing.  A little extra logic is needed to watch for dangling end-characters for nestable comments.

During blanking, the code builds up character ranges for the different kinds of comments.  The backend then communicates these to the front end, also indicating which kind of comment it is.  The latter information is useful for setting stickiness of text properties in emacs.  You can assign characters text properties, which are retained as text is manipulated (moved, copied, etc.).  I decided to try using the category property, which gives a layer of indirection: you assign the properties you want (like face, which controls the display of the character) to an elisp symbol, and then make that symbol the value of the category property for a piece of text.  Then all the properties you bound to the symbol are applied to the text.  Nifty.  All comments get the same face for comment highlighting.  But I made different categories for the different kinds of comments, so I could set the stickiness of text properties differently for them.  Text properties in emacs can be front- or rear-sticky, which means that they are inherited by new characters inserted just before or just after, respectively, the text with the properties.  Comments to end of line should be rear sticky but not front sticky, and vice versa for comments to beginning of line.  Nestable comments should be front and rear sticky inside the delimiters, but start-characters should not be front sticky and end characters should not be rear-sticky.  So when you type new text right next to a comment , the comment highlighting is inherited or not, appropriately.

Just posting this in case you, too, care to comment.

A little while ago, I posted a challenge to come up with examples where optimal beta-reduction would reduce a term efficiently, but call-by-need evaluation in Haskell (with optimizations on) would be inefficient. A little while ago? Wow, it was last summer! Well, summer time for me has become time to think about optimal beta-reduction. What a hobby, right? And the semester is ending here at U. Iowa, so here it goes again.

Anyhow, I have a little bit of a new challenge — and this time a solution to it (instead of needing help to find a solution: thank you friends!). The examples I got last time (from Victor Maia, many thanks) were all ones that reduce efficiently with optimal beta, but not with call-by-need. You can watch Haskell grind and grind on examples that are instantaneous with optimal beta. But these examples are — or should be; more on that shortly — efficiently executable with normal-order reduction [edit: actually the strategy is a little more than that; see below]. This means that theorem provers like Coq and Agda should be able to reduce them efficiently.

Ok, so the new challenge is to find an example that is inefficient with normal-order reduction (and hence would certainly also be in Haskell), but would be efficient with optimal beta. In a sense Victor already solved this challenge with his very inspiring Medium posts of a couple years back (like this one), as these are examples that execute asymptotically much faster with optimal beta than with call-by-need, and normal-order reduction does not help as they are actually not higher order. But I am hoping for a little simpler example, exhibiting exponential behavior under normal-order reduction. Also, I would like to compare with normal-order reduction explicitly.

The solution (suggested by Victor) to my first challenge is this one:

busy :: Int -> Int -> Int
busy 0 = \ x -> x
busy n = \ x ->
  let f = busy (n - 1)
  in f (f x)

Asking ghc to evaluate something like busy 34 0 will take a while; call-by-need evaluation is exponential in n. But busy n is always just the identity function, and optimal-beta will take take time linear in n. But (and this realization is what spurred my interest in the new challenge), normal-order evaluation will also be linear in n, because we will reduce busy (n – 1) to the identity function, and then f (f x) reduces in just two beta-reductions to x. [edit: for this to work, we need to use a version of normal-order reduction that evaluates let-terms strictly; let x = t1 in t2 needs t1 to be evaluated first, then t2]

I tried this out in Cedille, which implements [this version of] normal-order evaluation, and sure enough, the runtime is extremely fast, linear in n. Our interpreter is very naive, so evaluating busy 64 would definitely explode if evaluation were exponential — but it completes in a second or two.

Now what surprised me is that this example takes exponential time in both Coq and Agda! Weird, right? Why is that? I am puzzled, and emailed the Agda list to see if anyone has any ideas. [edit: Agda inlines let-definitions, causing the example to blow up.]

But here is a modified version of the example that makes Cedille choke (and certainly Coq, Agda, and I confirmed Haskell as well):

type Id = forall a . a -> a

busy :: Int -> Id -> Id
busy 0 = \ x -> x x
busy n = \ x ->
  let f = busy (n - 1)
  in f (f x)

Evaluating busy n builds a lambda term which takes in x and then builds a balanced binary tree of depth O(n) of applications of x. For example, Cedille’s normal-order evaluator will show this result for busy 2:

λ x . x x (x x) (x x (x x)) (x x (x x) (x x (x x)))

Here we see a nice example of the power of optimal-beta, because optimal-beta will represent this exponential-sized term in linear space! And applying busy n to the identity function will reduce in linear time. The key point about this example beyond the first version of busy, is that it copies what the Asperti and Guerrini book call “virtual redexes”, namely applications of variables. Those variables could become instantiated later, creating redexes, and we better not copy any redexes (ever!) if we want to be optimal. So copying a virtual redex could lead to non-optimality.

On many PL researchers’ lists of most important ever papers in PL is “Types, Abstraction, and Parametric Polymorphism” by the great John C. Reynolds.  I have been studying the paper rather intensively the past month or so, and would like to share some thoughts (and even a legend for the paper’s notation, at the end of the post!).

In this paper, Reynolds sets out to characterize mathematically what parametric polymorphism is.  Part of the idea is the familiar goal (in Computer Science) that different implementations of the same interface should be interchangeable: as long as they behave the same with respect to the interface, then no other piece of code should be able to tell the difference between them.  Another part of the idea is that type polymorphism can be used to enforce this, as different clients of some interface cannot discover details about the implementation that are hidden behind some type abstraction.

To move these ideas from the realm of intuition to theorems, Reynolds defines two semantics for types.  The first interprets types as sets.  For example, a type like \textit{Nat} \to \textit{Nat} is interpreted as the set of all set-theoretic (mathematical) functions from natural numbers to natural numbers (assuming the type Nat gets interpreted as the set of natural numbers).  Programs in the polymorphic typed lambda calculus he considers are then interpreted as elements of those sets.  For example, \lambda\ x : \textit{Nat}.\ x will be interpeted as the set-theoretic identity function on the set of natural numbers.  This part of the paper is not too surprising.  The type structure of the language enables a straightforward set-theoretic semantics for typed lambda calculus, at least until it comes to impredicative polymorphism.  Here Reynolds speculates in the paper that the semantics will work out, but discovers the next year that it does not (a problem shortly corrected by others who use a different background mathematical theory which avoids the problems raised by impredicativity).  But the paper actually starts out just with a simply typed language, where things work out cleanly, and then adds on polymorphism (where things do not).

The second semantics for types interprets them as binary relations on the sets given by the first semantics.  The relations are the ones familiar from the setting of logical relations.  For example, the relation associated with \textit{Nat} \to \textit{Nat} relates two unary operations on the set of natural numbers iff those functions map related inputs to related outputs (related by the relation for Nat, which Reynolds will take to be the identity relation).  So they should map equal numeric inputs to equal numeric outputs.

The part of the paper that was driving me crazy is the Identity Extension Lemma.  This says that if you have a type and you interpret all its free type variables as identity relations (on the sets that the first interpretation assigns to those type variables), then the interpretation of that type will also be the identity relation.  So identity relations are extended by the semantics from type variables to all types.

Now, the vexing question to me was, why are we setting up all this machinery to define a bunch of identity relations?  There is one formalistic answer, which is that when it comes to polymorphism, Reynolds achieves Identity Extension by the expedient (not to say hack) of making the first interpretation of types invoke the second one (i.e., the relational one), to say that the set interpretation of a polymorphic type is the set of set-theoretic functions which satisfy Identity Extension.  Voila, Identity Extension holds now for polymorphic types!

But now the real answer — and I wish Reynolds had given more hints about this, or if he did, that I noticed them better: the point of the paper is not the relational semantics, though this is technically the most interesting part.  Really, we are just trying to define the set-theoretic semantics for polymorphic types, in such a way that all the functions we intend to be extensionally equivalent are soundly mapped down to the exact same mathematical object.  So we want to interpret mergesort and quicksort as the unique set-theoretic list-sorting function.  This is why Identity Extension is so important: it shows we did this correctly!  All the interesting equivalence has gotten squashed down in that first semantics.  Thus, the second semantics doesn’t have anything interesting to add.  But it is there as a specification of what we were trying to achieve.  For it is the description of the extensional equality we are trying to mash down with the first semantics.

The fact that the relational semantics is required by the set interpretation of polymorphism is, to me, a total cop-out.  It means that we only achieved the squashing down we wanted by including that extensional equality in our set interpretations.  So for the difficult part of the matter (i.e., polymorphism), it is as if we just said that the set interpretation of type A is the set of equivalence classes of terms, where the equivalence in question is the one given by the relational semantics.  If our goal was to achieve a set interpretation that satisfied that extensional equivalence (i.e., Identity Extension) but was independent of it, then at this point we failed.

Now, this is just the seminal paper in a pretty technically involved line of research.  It looks, though, in more modern treatments, as if the basic set-up is accepted: two semantics, mutually dependent.  That is clearly flawed, if the goal is as I just described (set interpretations which are independent of but satisfy the relational interpretations).  But perhaps that is not the goal of these more recent works?

If you think I am mistaken in any of the above, please perform your charitable deed of the day and correct me!  I am not an expert, just someone trying to understand…

As a service to other students of this paper, I am providing a short legend (reynolds-key) of notation for the first bit of it.  I will probably update this after our next reading-group meeting, when another member of the group is planning to add to it.

Victor Maia, of course.  I called for a simple example that shows non-optimality of ghc -O2, and Victor shared with me several awesome examples.  I modified the first of these to get something similar to what I was trying out in the previous post:

module Main where
import System.Environment

powstr :: Integer -> String
powstr x = show (x ^ x)

digit :: Int -> Int -> Char
digit x n = powstr (fromIntegral x) !! n

str :: Int -> String
str x = fmap (digit x) [0..(x-1)]

main :: IO ()
main = do
 (a : _ ) <- getArgs
 let n = read a :: Int

putStrLn $ str n

This program repeatedly (by fmap’ing over [0..(x-1)]) computes the string representation of x ^ x to get the n’th digit of that representation, and rebuilds that string representation from all those digits.  This is not a good way to compute that string, of course, but compiling with ghc -O2 (version 8.6.5), I see this behavior:

$ time ./Another 5000 > /dev/null

real 0m2.051s
user 0m2.042s
sys 0m0.009s

Now compare with a different version of that program, which computes the string representation of x ^ x just once, and then rebuilds that by selecting each digit:

module Main where
import System.Environment

powstr :: Integer -> String
powstr x = show (x ^ x)

str :: Int -> String
str x =
 let s = powstr (fromIntegral x) in
  fmap ((!!) s) [0..(x-1)]

main :: IO ()
main = do
 (a : _ ) <- getArgs
 let n = read a :: Int

putStrLn $ str n

 

Compiling this with -O2, I then see:

$ time ./Anotherl 5000 > /dev/null

real 0m0.075s
user 0m0.063s
sys 0m0.013s

 

But with optimal beta, the first version will run in exactly the same time as the second, as the string representation of x ^ x will only get computed once across all those calls (in the first program) to ‘digit’.

So the basic idea of this path to displaying non-optimality is: repeatedly compute something rather big (in this case, the string representation of x ^ x) extracting something from it.  The big thing you compute needs to be the same across your many extractions.  Optimal will compute the big thing once, but at least in this case, we manage to get ghc -O2 to compute it again and again.

The second program Victor sent is more artificial but also cooler:

module Main where

busy :: Int -> Int -> Int
busy 0 = \ x -> x
busy n = \ x ->
 let f = busy (n - 1)
 in f (f x)

main :: IO ()
main = print $ busy 32 0

Compiling with ghc -O2 I find:

antioch:~/optimal-beta/haskell$ time ./Busy 
0

real 0m10.492s
user 0m10.491s
sys 0m0.000s

 

But with optimal reduction, this takes time linear (as opposed to exponential) in n: inductively, busy (n – 1) is the identity, and then busy n reduces to the identity.  Critically, we must reduce under the lambda in the body of busy for this to work; ghc won’t do that but optimal will.

Many thanks to Victor, Adam, and Lukasz for contributing to this discussion.  I hope you found it interesting.  Please comment if you have further thoughts on these experiments or optimal reduction vs mainstream compilation technology.

Optimal beta reduction is the most amazing technology you have (probably) never heard of.  Through more complex graph representations of lambda terms, optimal beta reduction schemes (and there are a couple) guarantee that normal forms for terms are reached in the optimal number of steps.  There is no duplication of work: any redex needed on the way to a normal form is guaranteed to be reduced just once.

Now, depending on how much you have heard of this before (and if you haven’t, please check out the awesome posts by Victor Maia), and your background, you may be thinking: but come on, the functional programming community has worked on efficient execution of languages based on lambda calculus for many decades.  Some of the most brilliant Computer Scientists in the world have studied this and developed very efficient ways to compile and execute programs in such languages.  And isn’t Haskell optimal anyway?  It implements call-by-need reduction so that should be the best you can do, right?

Wrong.  Algorithms for optimal beta reduction were developed starting with Lamping’s 1990 POPL paper (sorry, paywall) — so have not been around as long.  The literature is much sparser, possibly because the algorithms are very intricate and their correctness rests on quite technical work of Jean-Jacques Levy in the late 1970s, who managed to define a form of parallel reduction that he proved guaranteed optimality (but for which he did not have a proposal for implementation).  But unlike the efficient execution schemes implemented by the compiler for your favorite functional programming language, optimal beta reduction algorithms (and here I count just Lamping’s and an algorithm called lambdascope — these are the only ones I am aware of that claim optimality) are optimal.  You cannot trick them into computing redexes multiple times.

To help show the difference — and as I have been trying to understand the relationship between the rich literature in FP on compilation and optimal beta reduction — I performed a little experiment with Haskell.  Consider the following tiny Haskell program [modified from original post of a couple days ago]:

module Main where
import System.Environment

f x y _ = x ^ y
g x y = let r = x ^ y in \ _ -> r 

main :: IO ()
main = do
  (a0 : a1 : a2 : a3 : _ ) <- getArgs
  let w = read a0 :: Bool
  let n = read a1 :: Integer
  let m = read a2 :: Integer
  let k = read a3 :: Integer
  let s = sum $ map (if w then (f n m) else (g n m)) [0..k]
  
  putStrLn $ if even s then "even" else "odd"
  
  

 

This program reads in a Bool that controls whether the code calls function f or function g.  Both those functions return x ^ y, ignoring z; but g lifts that computation of x ^ y out of the lambda z, and f does not lift it. Then the code maps the chosen function, applied to n and m, over a list of the numbers 0 through k.  There is a summation to force evaluation.

The critical point here is that when the code for ‘main’ calls f n m, the value of n ^ m could be pre-computed, before f n m is mapped over the list.  Optimal beta reduction would do this, only computing n ^ m once despite many applications of f n m.  If the ghc-produced executable were doing this, too, then the cost of running the program with inputs n, m, and k should be equal (roughly) to the cost of computing n to the m, plus the cost of traversing the list.  On the other hand, if ghc cannot manage to precompute n ^ m, then we would expect to see the cost to equal more like k times the cost for n ^ m.  Let’s see what happens (on my laptop).

First compile with ghc with default optimization:

$ ghc --make Main
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking Main ...

 

Now run with function f:

$ time ./Main True 1000001 100000 1000 2
odd

real    0m8.103s
user    0m7.777s
sys 0m0.325s

 

Now with function g:

$ time ./Main False 1000001 100000 1000 2
odd

real    0m0.071s
user    0m0.053s
sys 0m0.018s

 

We see that running with g is much faster than running with f.

So what have we learned?  Executables produced by ghc with its default optimizations do not implement optimal reduction (they were not claimed to, just to be clear), and we can expose this with a simple example.

Now maybe you think this is a lesson with no leasing: just showing that compiler X is not as fast as you can imagine it could be from the comfort of your armchair is neither sporting nor informative.  For surely one can cherry pick this or that weird example and see something imperfect?

But this example is relying on a very basic feature of higher-order programming, namely the ability for functions to make use of variables bound outside their scopes.  Indeed, that lengthy literature I was mentioning is largely concerned exactly with how to implement this efficiently.  And guess what: those methods fall short on exactly the kind of test they should be able to handle well.  Dealing with open lambda terms (ones which contain variables bound outside their scopes) is also the chief source of difficulty for optimal beta reduction.  But those algorithms, despite their difficulties, actually solve the problem completely.

So why aren’t people using optimal beta?  The simple answer is that the machinery proposed to deal with open terms in those algorithms adds quite a bit of overhead to reduction (and makes implementation a scary prospect).  So almost for sure you would need heroic engineering to get anywhere close to ghc and other awesome compilers on examples outside this class where optimal easily crushes them.

What I am working on currently is how to solve the problem of open terms in optimal beta with a lot less machinery, to get something that could compete with those tools with just a realistic (for me!) amount of engineering.  No results yet, but some promising leads…

A final important update: this example does not exhibit different behavior if we compile with -O2 (thanks to commenter Adam for pointing this out).  I could not find a way to tweak the example to show non-optimality with -O2, despite some (maybe not sufficiently expert?) twiddling.

So this is an open call for small Haskell programs that exhibit non-optimal behavior with -O2.

Thanks for reading!  Feedback very welcome (particularly if you think I am out to lunch on this — please challenge/correct me).