# theories of testing

Here’s a fun fact for you. Consider the type of unlabelled binary trees, with its catamorphism/iterator/fold thingy.

data Tree = Leaf | Tree :^: Tree cata :: t -> (t -> t -> t) -> Tree -> t cata leaf node Leaf = leaf cata leaf node (Node l r) = node (cata leaf node l) (cata leaf node r)

I have in mind a particular function

myf :: Tree -> Tree myf = cata myl myn where myl :: Tree myl = ... -- built with constructors myn :: Tree -> Tree -> Tree myn myf_l myf_r = ... -- built with constructors, myf_l and myf_r

and I describe to you what it does. Perhaps you think you can do the same job. You hand me yrf, which you promise you implemented in the same way.

yrf :: Tree -> Tree yrf = cata yrl yrn where yrl :: Tree yrl = ... -- built with constructors yrn :: Tree -> Tree -> Tree yrn yrf_l yrf_r = ... -- built with constructors, yrf_l and yrf_r

Assuming your promise about the implementation is in good faith, I test

and [ myf t == yrf t | t <- [Leaf, Node Leaf Leaf, Node (Node Leaf Leaf) Leaf, Node Leaf (Node Leaf Leaf)] ]

and the result is True. Congratulations! You got it right!

I don’t need to be given yrl and yrn. I just need to know that you built yrf that way. Of course, I can obtain yrl by evaluating yrf Leaf. Now that I know what yrl is, evaluating yrf (Node Leaf Leaf) gives me a bunch of candidates for yrn. I just look for occurrences of yrl as a subterm of yrf (Node Leaf Leaf), because I know that yrf_l and yrf_r will both have been yrf Leaf = yrl when yrn was called. The trouble is that I don’t know which occurrences of yrl are yrf_l, which are yrf_r, and which are just constants by an astounding coincidence. So I perturb my experiment. When I evaluate yrf (Node (Node Leaf Leaf) Leaf) and compare, I can see which yrl occurrences have become something else, and those must correspond to the only thing which can have changed, yrf_l. Similarly, yrf (Node Leaf (Node Leaf Leaf)) tells me which yrl occurrences must have been yrf_r. If you were telling the truth about the simplicity of your program’s construction, I’ve checked its coincidence with mine on all inputs just by testing it on four.

I’ve been fascinated with such results for a while now. This draft, A Polynomial Testing Principle, formalises the result that extensional equality for a language of polynomials and summation can be decided by testing just for the input 0..n, where n is an upper bound on the degree of the polynomials. Classic inductive proofs such as (Sum{i=1..n}i)^2 = Sum{i=1..n}i^3 hold just by checking we get 0,1,9,36,100. What’s fun about the proof is that it doesn’t step beyond the natural numbers. No subtractions, no fractions, I do the words and Agda does the actions. The method is to define the forward difference operator, *symbolically*, check that it reduces degree (which amounts merely to being well typed), then prove that it satisfies the specification

f n + diff f n = f (suc n)

for all n, which is an easy structural induction on the code for f. Now, if we know f 0 = g 0 and diff f agrees everywhere with diff g, then f must agree with g. If f and g agree on {0..suc n}, then f 0 = g 0 and

diff f agrees with diff g on {0..n}, hence everywhere by induction.

Thinking back to the result on trees, the argument has a similar character, deducing the dependency of a function on its input by making strategic perturbations to them and computing differences. So, is there a more general theory of differential testing?

It’s funny, of course, because I’m used to having a partial evaluator handy. I can happily run programs on open terms. The essence of proof by induction is that you prove a result for all inputs by testing a cover of open constructor forms which allow just enough computation to reduce the problem for the whole inputs to the problem for their substructures. The challenge here is voluntarily to give up partial evaluation and achieve results of the same strength by ‘dark grey box’ testing. We don’t get to see the implementation, but we do get a bound on how weird it is, and from that we deduce not merely a test suite which has 100% code coverage, but which actually allows us to establish the absence of bugs. That’s got to be worth thinking about, right?

“Much Ado About Two” is where I became aware of such approaches: http://www.janis-voigtlaender.eu/papers/MuchAdoAboutTwo.pdf

An alternate way to think about this (I think?) is to say that functions of certain sorts are _completely defined_ by their action on a small number of “test values”. This is similar to, but slightly different to your approach, which describes two functions that “agree on all results.” The difference makes it, perhaps, slightly tricky to see that this must be some sort of straightforward extension of all the classical results of algebraic geometry. After all, if this is about, effectively, a unique solution to a system of equations, then it must also be the case that we can discuss systems that permit multiple solutions, and thence into symmetries and groups!

One can also (and I wonder if I learned this from a comment of yours somewhere? It seems possible!) notice that e.g. for your tree example you need the action at four points. Similarly for other structures you would need the action at a further number of points. Now it seems to me that if we go through and look at a few different structures and figure out how much testing each needs, this should establish some numeric pattern that in a sense should come from some “characteristic polynomial” of our catamorphisms, and from there begin to crank out a general method for establishing such results for our standard inductive structures.

Forgive me for just making this more explicit — most of this is latent between your description in this post and the draft already 🙂

When I read your post, I thought to myself, “I’ve seen this sort of thing before in my complexity theory class”, and went to look at non-Turing-complete computation models for examples. But the temptation to look in the structure of your representation to determine equivalence is very great, as seen if you look at how one goes about proving equivalence of regular languages.

So complexity theory is a red (pink?) herring: machine learning is the more apt comparison. Let us rephrase it so: the task you have set upon us is only slightly weaker than asking us to *learn* (reverse engineer?) the function, given classification model (the bound on how weird the function is) and some samples (your test case).

Now, what is giving me fits is that I only learned how to do PAC learning on boolean functions, and while considering just *classification* was a very useful assumption for machine learning researchers, it seems to make the problem harder for your purposes: you only need n+1 samples to uniquely determine an n-degree polynomial, but the VC dimension of a n-degree polynomial classifier is (n^2 + 3n + 2)/2. I suppose we can difference booleans, but it’s like trying to drink the ocean through a straw.

From your choice of words, I suspect you have already thought of what I am about to say.

Going from a finite set of tests to a universal claim basically requires making a continuity assumptions about the program — we want to argue that the program makes small changes in its output in response to small changes in its input (for some chosen notion of small), and then argue that a finite set of tests rules out the possibility of a program going wrong because that would imply a discontinuous wiggle.

The usual problem with this story is that discrete types (like Booleans) imply that one-bit flips can radically alter the behaviour of a program. But it strikes me that Marcelo Fiore and Alex Simpson may have an approach that might be worth looking at. In their paper

Lambda Definability with Sums via Grothendieck Logical Relations, they showed how to equip sums with a Grothendieck topology, basically formalizing the idea that you can cover a neutral term of sum type by considering whether it might give you an inl or an inr case. This might give you the formal geometric machinery you need to get off the ground…Thanks for the pointer to Marcelo and Alex: that’s indeed the sort of equipment I’m likely to need. You’re also quite right to observe that the whole game is about identifying the continuity assumptions which determine the adequate testing strategy: in the case of polynomials in one variable, this more or less amounts to counting the wiggles.

In the discrete case, it’s going to involve more careful chopping up into regions, and we’ll certainly need to know which tests a program does (which is one reason to start with just case analysis, but we’ll never scale up unless we can handle “more than a million”). In this respect, “parametric” (or perhaps the even more restrictive notion of “synthetic”, i.e., just building with constructors, and maybe there’s some connection to the weak LF function space there) is the new “linear”. My result about folds on trees used testing to determine a non-parametric function (it distinguishes Leaf from Node), but whose branches are synthetic (hence testing by small perturbation to determine which synthesis is being done). I don’t know if we can go even two layers deep.

But it seems reasonable to me to consider a hierarchy of “investigativity” for the arguments of a function, a resource bounding the discontinuity with which they can be treated. You need this resource to test an input, and you use some of it up when you do. Inevitably, I begin to sniff a world system…

“discrete types (like Booleans) imply that one-bit flips can radically alter the behaviour of a program”

This seems precisely equivalent to the problem posed when we move from asking for solutions to polynomials across the Reals to, e.g., integral solutions. That is to say, it seems we should look for answers in modern development with regards to approaches to diophantine equations, no?