On many PL researchers’ lists of most important ever papers in PL is “Types, Abstraction, and Parametric Polymorphism” by the great John C. Reynolds.  I have been studying the paper rather intensively the past month or so, and would like to share some thoughts (and even a legend for the paper’s notation, at the end of the post!).

In this paper, Reynolds sets out to characterize mathematically what parametric polymorphism is.  Part of the idea is the familiar goal (in Computer Science) that different implementations of the same interface should be interchangeable: as long as they behave the same with respect to the interface, then no other piece of code should be able to tell the difference between them.  Another part of the idea is that type polymorphism can be used to enforce this, as different clients of some interface cannot discover details about the implementation that are hidden behind some type abstraction.

To move these ideas from the realm of intuition to theorems, Reynolds defines two semantics for types.  The first interprets types as sets.  For example, a type like $\textit{Nat} \to \textit{Nat}$ is interpreted as the set of all set-theoretic (mathematical) functions from natural numbers to natural numbers (assuming the type Nat gets interpreted as the set of natural numbers).  Programs in the polymorphic typed lambda calculus he considers are then interpreted as elements of those sets.  For example, $\lambda\ x : \textit{Nat}.\ x$ will be interpeted as the set-theoretic identity function on the set of natural numbers.  This part of the paper is not too surprising.  The type structure of the language enables a straightforward set-theoretic semantics for typed lambda calculus, at least until it comes to impredicative polymorphism.  Here Reynolds speculates in the paper that the semantics will work out, but discovers the next year that it does not (a problem shortly corrected by others who use a different background mathematical theory which avoids the problems raised by impredicativity).  But the paper actually starts out just with a simply typed language, where things work out cleanly, and then adds on polymorphism (where things do not).

The second semantics for types interprets them as binary relations on the sets given by the first semantics.  The relations are the ones familiar from the setting of logical relations.  For example, the relation associated with $\textit{Nat} \to \textit{Nat}$ relates two unary operations on the set of natural numbers iff those functions map related inputs to related outputs (related by the relation for Nat, which Reynolds will take to be the identity relation).  So they should map equal numeric inputs to equal numeric outputs.

The part of the paper that was driving me crazy is the Identity Extension Lemma.  This says that if you have a type and you interpret all its free type variables as identity relations (on the sets that the first interpretation assigns to those type variables), then the interpretation of that type will also be the identity relation.  So identity relations are extended by the semantics from type variables to all types.

Now, the vexing question to me was, why are we setting up all this machinery to define a bunch of identity relations?  There is one formalistic answer, which is that when it comes to polymorphism, Reynolds achieves Identity Extension by the expedient (not to say hack) of making the first interpretation of types invoke the second one (i.e., the relational one), to say that the set interpretation of a polymorphic type is the set of set-theoretic functions which satisfy Identity Extension.  Voila, Identity Extension holds now for polymorphic types!

But now the real answer — and I wish Reynolds had given more hints about this, or if he did, that I noticed them better: the point of the paper is not the relational semantics, though this is technically the most interesting part.  Really, we are just trying to define the set-theoretic semantics for polymorphic types, in such a way that all the functions we intend to be extensionally equivalent are soundly mapped down to the exact same mathematical object.  So we want to interpret mergesort and quicksort as the unique set-theoretic list-sorting function.  This is why Identity Extension is so important: it shows we did this correctly!  All the interesting equivalence has gotten squashed down in that first semantics.  Thus, the second semantics doesn’t have anything interesting to add.  But it is there as a specification of what we were trying to achieve.  For it is the description of the extensional equality we are trying to mash down with the first semantics.

The fact that the relational semantics is required by the set interpretation of polymorphism is, to me, a total cop-out.  It means that we only achieved the squashing down we wanted by including that extensional equality in our set interpretations.  So for the difficult part of the matter (i.e., polymorphism), it is as if we just said that the set interpretation of type A is the set of equivalence classes of terms, where the equivalence in question is the one given by the relational semantics.  If our goal was to achieve a set interpretation that satisfied that extensional equivalence (i.e., Identity Extension) but was independent of it, then at this point we failed.

Now, this is just the seminal paper in a pretty technically involved line of research.  It looks, though, in more modern treatments, as if the basic set-up is accepted: two semantics, mutually dependent.  That is clearly flawed, if the goal is as I just described (set interpretations which are independent of but satisfy the relational interpretations).  But perhaps that is not the goal of these more recent works?

If you think I am mistaken in any of the above, please perform your charitable deed of the day and correct me!  I am not an expert, just someone trying to understand…

As a service to other students of this paper, I am providing a short legend (reynolds-key) of notation for the first bit of it.  I will probably update this after our next reading-group meeting, when another member of the group is planning to add to it.

Victor Maia, of course.  I called for a simple example that shows non-optimality of ghc -O2, and Victor shared with me several awesome examples.  I modified the first of these to get something similar to what I was trying out in the previous post:

module Main where
import System.Environment

powstr :: Integer -> String
powstr x = show (x ^ x)

digit :: Int -> Int -> Char
digit x n = powstr (fromIntegral x) !! n

str :: Int -> String
str x = fmap (digit x) [0..(x-1)]

main :: IO ()
main = do
(a : _ ) <- getArgs
let n = read a :: Int

putStrLn $str n This program repeatedly (by fmap’ing over [0..(x-1)]) computes the string representation of x ^ x to get the n’th digit of that representation, and rebuilds that string representation from all those digits. This is not a good way to compute that string, of course, but compiling with ghc -O2 (version 8.6.5), I see this behavior: $ time ./Another 5000 > /dev/null

real 0m2.051s
user 0m2.042s
sys 0m0.009s

Now compare with a different version of that program, which computes the string representation of x ^ x just once, and then rebuilds that by selecting each digit:

module Main where
import System.Environment

powstr :: Integer -> String
powstr x = show (x ^ x)

str :: Int -> String
str x =
let s = powstr (fromIntegral x) in
fmap ((!!) s) [0..(x-1)]

main :: IO ()
main = do
(a : _ ) <- getArgs
let n = read a :: Int

putStrLn $str n Compiling this with -O2, I then see: $ time ./Anotherl 5000 > /dev/null

real 0m0.075s
user 0m0.063s
sys 0m0.013s

But with optimal beta, the first version will run in exactly the same time as the second, as the string representation of x ^ x will only get computed once across all those calls (in the first program) to ‘digit’.

So the basic idea of this path to displaying non-optimality is: repeatedly compute something rather big (in this case, the string representation of x ^ x) extracting something from it.  The big thing you compute needs to be the same across your many extractions.  Optimal will compute the big thing once, but at least in this case, we manage to get ghc -O2 to compute it again and again.

The second program Victor sent is more artificial but also cooler:

module Main where

busy :: Int -> Int -> Int
busy 0 = \ x -> x
busy n = \ x ->
let f = busy (n - 1)
in f (f x)

main :: IO ()
main = print $busy 32 0 Compiling with ghc -O2 I find: antioch:~/optimal-beta/haskell$ time ./Busy
0

real 0m10.492s
user 0m10.491s
sys 0m0.000s

But with optimal reduction, this takes time linear (as opposed to exponential) in n: inductively, busy (n – 1) is the identity, and then busy n reduces to the identity.  Critically, we must reduce under the lambda in the body of busy for this to work; ghc won’t do that but optimal will.

Many thanks to Victor, Adam, and Lukasz for contributing to this discussion.  I hope you found it interesting.  Please comment if you have further thoughts on these experiments or optimal reduction vs mainstream compilation technology.

Optimal beta reduction is the most amazing technology you have (probably) never heard of.  Through more complex graph representations of lambda terms, optimal beta reduction schemes (and there are a couple) guarantee that normal forms for terms are reached in the optimal number of steps.  There is no duplication of work: any redex needed on the way to a normal form is guaranteed to be reduced just once.

Now, depending on how much you have heard of this before (and if you haven’t, please check out the awesome posts by Victor Maia), and your background, you may be thinking: but come on, the functional programming community has worked on efficient execution of languages based on lambda calculus for many decades.  Some of the most brilliant Computer Scientists in the world have studied this and developed very efficient ways to compile and execute programs in such languages.  And isn’t Haskell optimal anyway?  It implements call-by-need reduction so that should be the best you can do, right?

Wrong.  Algorithms for optimal beta reduction were developed starting with Lamping’s 1990 POPL paper (sorry, paywall) — so have not been around as long.  The literature is much sparser, possibly because the algorithms are very intricate and their correctness rests on quite technical work of Jean-Jacques Levy in the late 1970s, who managed to define a form of parallel reduction that he proved guaranteed optimality (but for which he did not have a proposal for implementation).  But unlike the efficient execution schemes implemented by the compiler for your favorite functional programming language, optimal beta reduction algorithms (and here I count just Lamping’s and an algorithm called lambdascope — these are the only ones I am aware of that claim optimality) are optimal.  You cannot trick them into computing redexes multiple times.

To help show the difference — and as I have been trying to understand the relationship between the rich literature in FP on compilation and optimal beta reduction — I performed a little experiment with Haskell.  Consider the following tiny Haskell program [modified from original post of a couple days ago]:

module Main where
import System.Environment

f x y _ = x ^ y
g x y = let r = x ^ y in \ _ -> r

main :: IO ()
main = do
(a0 : a1 : a2 : a3 : _ ) <- getArgs
let w = read a0 :: Bool
let n = read a1 :: Integer
let m = read a2 :: Integer
let k = read a3 :: Integer
let s = sum $map (if w then (f n m) else (g n m)) [0..k] putStrLn$ if even s then "even" else "odd"



This program reads in a Bool that controls whether the code calls function f or function g.  Both those functions return x ^ y, ignoring z; but g lifts that computation of x ^ y out of the lambda z, and f does not lift it. Then the code maps the chosen function, applied to n and m, over a list of the numbers 0 through k.  There is a summation to force evaluation.

The critical point here is that when the code for ‘main’ calls f n m, the value of n ^ m could be pre-computed, before f n m is mapped over the list.  Optimal beta reduction would do this, only computing n ^ m once despite many applications of f n m.  If the ghc-produced executable were doing this, too, then the cost of running the program with inputs n, m, and k should be equal (roughly) to the cost of computing n to the m, plus the cost of traversing the list.  On the other hand, if ghc cannot manage to precompute n ^ m, then we would expect to see the cost to equal more like k times the cost for n ^ m.  Let’s see what happens (on my laptop).

First compile with ghc with default optimization:

$ghc --make Main [1 of 1] Compiling Main ( Main.hs, Main.o ) Linking Main ... Now run with function f: $ time ./Main True 1000001 100000 1000 2
odd

real    0m8.103s
user    0m7.777s
sys 0m0.325s

Now with function g:

\$ time ./Main False 1000001 100000 1000 2
odd

real    0m0.071s
user    0m0.053s
sys 0m0.018s

We see that running with g is much faster than running with f.

So what have we learned?  Executables produced by ghc with its default optimizations do not implement optimal reduction (they were not claimed to, just to be clear), and we can expose this with a simple example.

Now maybe you think this is a lesson with no leasing: just showing that compiler X is not as fast as you can imagine it could be from the comfort of your armchair is neither sporting nor informative.  For surely one can cherry pick this or that weird example and see something imperfect?

But this example is relying on a very basic feature of higher-order programming, namely the ability for functions to make use of variables bound outside their scopes.  Indeed, that lengthy literature I was mentioning is largely concerned exactly with how to implement this efficiently.  And guess what: those methods fall short on exactly the kind of test they should be able to handle well.  Dealing with open lambda terms (ones which contain variables bound outside their scopes) is also the chief source of difficulty for optimal beta reduction.  But those algorithms, despite their difficulties, actually solve the problem completely.

So why aren’t people using optimal beta?  The simple answer is that the machinery proposed to deal with open terms in those algorithms adds quite a bit of overhead to reduction (and makes implementation a scary prospect).  So almost for sure you would need heroic engineering to get anywhere close to ghc and other awesome compilers on examples outside this class where optimal easily crushes them.

What I am working on currently is how to solve the problem of open terms in optimal beta with a lot less machinery, to get something that could compete with those tools with just a realistic (for me!) amount of engineering.  No results yet, but some promising leads…

A final important update: this example does not exhibit different behavior if we compile with -O2 (thanks to commenter Adam for pointing this out).  I could not find a way to tweak the example to show non-optimality with -O2, despite some (maybe not sufficiently expert?) twiddling.

So this is an open call for small Haskell programs that exhibit non-optimal behavior with -O2.

Thanks for reading!  Feedback very welcome (particularly if you think I am out to lunch on this — please challenge/correct me).

Not the kind of reflection (I think!) one finds in Java, or even for meta-programming in Coq or Idris or such.  I am speaking of the principle from the algebra of programming that is called reflection (not quite sure why).  It is a property relating eliminations and constructions of datatypes.  It says that folding with the constructors is the identity.  For example, take lists in Haskell, with foldr (I specialized this from Foldable a to just [a])

foldr :: (a -> b -> b) -> b -> [a] -> b

and constructors

(:) :: a -> [a] -> [a]
[] :: [a]

Instantiating (implicitly) type variable b with [a], we can write

foldr (:) []

and this is equivalent to the identity function.  All it does is go through the list and interpret cons as cons, nil as nil.  So we have the valid equation

foldr (:) [] = id

It is known (although I am not sure where you can find this in a paper) that if you have parametricity and reflection, you can derive induction.  Unary parametricity for Church-encoded lists, say, is essentially the predicate

\ l : [a] .

forall x : * . forall c : a -> x -> x . forall n : x .

forall R : x -> * .

(forall h : a . forall t : x . R t -> R (c h t)) ->

R n ->

R (l c n)

In words, it says that if you are doing a foldr with c and n, and n satisfies R and c preserves R, then R holds of the foldr of l with c and n (i.e., the elimination l c n).

Now, if we just instantiate this with (:) for c and [] for n, we have a statement of list induction — almost.  We get to conclude the predicate R for (l (:) []), not just for l.  But hey, if we have reflection, we know that doing a fold right with (:) and [] is the identity, so we can get R l from R (l (:) []).

That is great as a technical trick — and we have used this in Cedille — but it seems to come out of nowhere.  Actually, reflection is a consequence of initiality for the initial algebra of the datatype, so it is definitely not out of nowhere.  But still it is not clear why parametricity + reflection should yield induction.  What is the intuition for that?

Actually, I have a little intuition to offer.  With induction, we are trying to summarize an infinite number of cases in some finite way.  This finite way is with the constructors for the datatype.  There are infinitely many natural numbers, say, but they are all constructed by finite application of successor to zero.  So the constructors summarize our infinite set of values of the datatype.

If we are in the mindset — as we are here in Cedille land — of not taking constructors and datatypes for granted, we might ask: but how do you know you correctly summarized that infinite set of values using your functions you call constructors (since with a lambda encoding, the constructors are just more lambda terms, not different in kind from all the others you are using).  My intuition about reflection is that it contributes an essential property towards knowing you have the correct summary of your infinite set of data.  You offer some set of functions and claim that all the data can be built using those.  For this to be correct, we would reasonably expect that doing an elimination with a piece of data and those functions should rebuild that piece of data.  Otherwise, what kind of a representation scheme is it that one has?  Perhaps your functions define some permutation of the data?  So eliminating l with your functions f and x would result in a different list l’?  At least this seems exotic.  And maybe one could argue that if reflection does not hold, then parametricity will not give you induction, except for predicates that respect your representation scheme (i.e., for predicates where the elimination does act like the identity).

So for purposes of deriving induction from parametricity, I guess we can say that reflection is telling us that our representation scheme is universal: any predicate we can prove from parametricity will also hold inductively.

Writing this out suggests the intriguing possibility that using different representation schemes, one could  derive induction for predicates respecting the scheme, without needing reflection.  Or similarly, one could demand, in the statement of induction, that the predicate one is proving respects reflection, and then reflection would not need to be proved once and for all.

Hope you are in good health, and thanks for reading.

I decided to start recording a podcast, called “The Iowa Type Theory Commute”.  I don’t have the podcast listed in any directories yet, but you can find it on my web page here:

http://www.cs.uiowa.edu/~astump/ittc.html