I'd argue there's a qualitative difference in approach between testing, a proced...

pron · 2025-01-11T10:08:02 1736590082

> and the more declarative (more "formal") method of directly proving things about a model

"Formal" simply means mechanical so neither is more formal than the other. There is, however, a qualitative and rather easily delineated difference between deductive methods that operate by applying deductive inference rules for some language and methods that operate on the model of the language (the universe of values, more or less). Neither is fundamentally more rigorous than the other, though. Remember, a test (assuming no nondeterminism) is a complete proof of the proposition it states, it's just that it can state only weak propositions, i.e. typically without quantifiers (forall, exists), and certainly without nested quantifiers.

For propositions with quantifiers, deductive and model based approaches can differ in cost. Usually, though not always, deductive methods are more expensive, so they're used less. Model based methods are less expensive because they can more easily reduce their "confidence". Complexity theory tells us that knowing the answer to some question has some minimal cost regardless of how the answer is learned. Model-based methods allow us to not really know the answer but increase our confidence in what it is, so their cost is more flexible.

bulatb · 2025-01-11T10:46:56 1736592416

I was struggling to find the word, but I think I do mean "declarative." It's also deductive, but I'm looking at it from the point of view of what I'd have to do if I were proving something. I'm on board with testing being formal as in algorithmic.

Seems like there's domain-specific terminology (almost wrote "language"!) I'm using accidentally or incorrectly and making a mess. I might have stepped in it with "model." Mainly I just think I see a difference between ways of verifying where the work is doing and the work is being. But it sounds like I don't know enough to have this conversation.

Thanks for humoring me, though.

pron · 2025-01-11T11:43:44 1736595824

It's important to separate how a statement about behaviour is made (e.g. "the program never loses data") and what mechanisms we take to gain confidence that the statement is true, up to, and including proof (i.e. 100% confidence). Such statements can certainly be said to be declarative (as they cannot be executed), but that's separate from how we gain confidence in their validity.

To get a sense of why software correctness is often more art than science, consider that an abstract algorithm can be proven correct but real software cannot because a software-based system is ultimately a physical system, and it is not possible to prove that an addition command given to a physical computer will always result in correct addition, even though that is true with high probability. Since confidence can never be 100%, and since many practices -- from testing through formal methods -- have been generally effective at producing software we can trust with our lives, the game is always about balancing confidence with cost.

I, for one, believe that TLA+ (and similar tools) can be used in quite a few situations to gain a required level of confidence more cost effectively than other methods, but there is no one right answer to software correctness and trick is learning when to apply different methods.

chriswarbo · 2025-01-11T12:59:19 1736600359

> Remember, a test (assuming no nondeterminism) is a complete proof of the proposition it states, it's just that it can state only weak propositions, i.e. typically without quantifiers (forall, exists), and certainly without nested quantifiers.

Typical unit tests (what fans of property based testing would call "example based tests") are existentially quantified propositions, e.g. if we squint at the test suite for a 'sort' function, we'll see propositions like "there exists a list l, such that sort(l) is sorted"; "there exists a list l, such that sort(sort(l)) equals sort(l)", "there exists a list l, such that sort(l) is a permutation of l", etc.

To make those executable, their proofs have to be constructive(/intuitionistic); e.g. we can't rely on double-negatives like "if sort(l) were not sorted it would lead to the following contradiction...". In constructive logic, the only way to prove an existential proposition is to construct a particular example and prove that this example satisfies the proposition. In a unit test, the example is our test data, the proposition is our assertion, and the proof comes from executing the test body to verify that the assertion is 'true'.

Thinking about test suites in this way can clarify our thoughts, which can lead us quite easily towards property-based testing and even inductive theorem proving. For example, consider a test like:

    def test_sort_ignores_initial_order() {
      assert(
        sort([3, 1, 2]) == sort([2, 1, 3])
      );
    }

This is fine, but we can do better. In particular, those literals have been given as-is, without any indication of their relationship (if any), so anyone reading this test (e.g. after breaking it) has to guess where they came from. Not good for a specification! In this case it's pretty clear, but I've seen many tests in the RealWorld™ which look like `assert(foo(x) == y)` with very complicated literals for `x` and `y`, where it appears the author just copy-pasted the function's output as `y`, and there's no clear relationship between the two (is the `7` in that dictionary the same as this `7` over here, or is that a coincidence? Oh, it was the sum of that `19` with the `-12` from there...).

Let's make this clearer to readers by stating the relationship:

    def test_sort_ignores_initial_order() {
      let l = [3, 1, 2];
      assert(
        sort(l) == sort(reverse(l))
      );
    }

Now this looks more like an existentially-quantified proposition: for some list, there is some permutation of it that gets the same output from `sort`. We prove this proposition, using the list `[3, 1, 2]` and the permutation `reverse`.

Such statements are quite weak, but it's quite easy to strengthen them into the universally-quantified propositions that we actually care about, by abstracting-out hard-coded details that don't matter:

    def test_sort_ignores_initial_order(l: List[Int]) {
      assert(
        sort(l) == sort(reverse(l))
      );
    }

Such abstraction goes beyond simple data values too, e.g. with a little effort we can abstract over the choice of permutation:

    def test_sort_ignores_initial_order(l: List[Int], permutation: Word) {
      assert(
        sort(l) == sort(permute(l, seed=permutation))
      );
    }

Unfortunately we can't prove universally-quantified statements just by executing their function body, but there are many paths open to us, with different costs/benefits:

- Recover the old existential test, by running it with the examples as arguments. Hypothesis makes this easy with its `@example` decorator.

- Search for counterexamples, to try and prove ourselves wrong. This is property-based testing, AKA property checking, and supported by many test frameworks (Hypothesis, QuickCheck, etc.); there are different approaches to the search problem, some integrate with fuzzers, etc.

- Run a symbolic execution engine to try and prove it correct. This would be dependent on the language, and may require translation to some external formalism; it may also require a lot of manual effort to guide the tooling; and there's no guarantee it would actually work.

- Use a theorem prover to try and prove it correct. Again, this depends on the language and may require translation. First-order logics can be heavily automated, but struggle with some common patterns (e.g. those which require induction). Higher order logics are more expressive, but the tooling is less automated.

There's obviously a spectrum of outcomes, which depends on the effort required. Whilst most wouldn't recoup the cost of an inductive proof; I think property-based testing is very often worth it, and even symbolic execution could be practical in some everyday business situations.

pron · 2025-01-11T13:55:44 1736603744

> Typical unit tests are existentially quantified propositions

That isn't quite right, though I know it's become a meme of sorts. An example, such as an automated test, can serve as a proof of an existentially quantified proposition but isn't itself one. For example, 3 > 2 is a proof of `∃x∈Int : x > 2` (because P(C) ⊢ ∃xP(x)) but isn't itself an existentially quantified proposition.

We can see that because if unit tests could express existentially quantified propositions then they could also express universally quantified ones by negation, which is available to tests, because existential and universal quantifiers are equally powerful, and can each be expressed in terms of the other as `(∀x : P(X)) ≣ (¬∃x : ¬P(x))` (we don't even need to resort to classical axioms because all domains are typically finite). [1]

Of course, unit tests could in principle express existentially quantified formulas (and consequently also universally quantified ones) through exhaustive loops over the domain, though the proof -- i.e. the execution in this case -- is not tractable in practice.

> Unfortunately we can't prove universally-quantified statements just by executing their function body, but there are many paths open to us, with different costs/benefits

Right, but there's another fully automated method that in some situations can serve as a full proof, though not a deductive one: model checking (indeed, there's a tool that offers it for TLA+). Contrary to some beliefs, model checking can efficiently exhaustively check infinite domains without symbolic techniques in some cases (though certainly not all, as that would violate computational complexity theorems; also, some model checkers use symbolic execution). For example, TLC, a concrete state (i.e. non-symbolic) model checker for TLA+ can fully prove, in under a second, some propositions quantified over an uncountably infinite number of executions, each of infinite length.

Of course, many interesting properties involve nested interleaved quantifiers (i.e. interleaved ∀ and ∃), and such propositions are strictly harder to prove (or be convinced of) than propositions with only a single quantifier.

Also, note that the correctness of a software system, unlike an abstract algorithm, can never be actually proven as it is a physical system for which there is no notion of proof. Even a complete proof (deductive or model-checked) provides confidence up to the belief that the hardware will behave as expected, something that can only be true with high probability. That is, indeed, why balancing cost and confidence is important, since complete confidence is never possible anyway.

[1]: Indeed, in some logics, both existential and universal quantifiers are not primitive and defined in terms of another operator, epsilon (https://en.wikipedia.org/wiki/Epsilon_calculus), which also exists in TLA+ and other formal logic languages.

chriswarbo · 2025-01-12T12:55:07 1736686507

> An example, such as an automated test, can serve as a proof of an existentially quantified proposition but isn't itself one. For example, 3 > 2 is a proof of `∃x∈Int : x > 2` (because P(C) ⊢ ∃xP(x)) but isn't itself an existentially quantified proposition.

Yeah, I hand-waved this a bit when I said "if we squint at the test suite for a 'sort' function, we'll see propositions like...". I don't think it's too far off though, since we can (and, as I argued, should) tease apart the data from the assertion, to get something like `let x = 3; assert(x > 2)`; and that's equivalent to `(lambda x: assert(x > 2))(3)` (which old-school Javascript devs may remember as "IIFE", before JS had 'let').

> We can see that because if unit tests could express existentially quantified propositions then they could also express universally quantified ones by negation, which is available to tests, because existential and universal quantifiers are equally powerful, and can each be expressed in terms of the other as `(∀x : P(X)) ≣ (¬∃x : ¬P(x))` (we don't even need to resort to classical axioms because all domains are typically finite).

Hmm, I hadn't thought about that. If I try stretching my constructive-logic argument a bit, we could implement negation `¬A` as a function `A -> Void`. Rewriting your `¬P(x)` would give `¬∃x : (P(x) -> Void)`, and rewriting the remaining negation gives `(∃x : (P(x) -> Void)) -> Void`. In a constructive setting we can take `∃x : (P(x) -> Void)` to be a pair containing `x` and a function of `f: P(x) -> Void`, and hence to implement this in general we need to call that function `f` (since it's the only way to construct `Void`), so we need to construct an argument for it, which requires constructing `P(x)` for general `x`, and hence requires a proof of `P(x)` that holds for all `x` (i.e. universally quantified).

I'm not sure of my intuitions around this. I got the above by "following the types", but test suites rely on runtime execution, which is inherently untyped (e.g. every program in a typed lambda calculus also exists in untyped lambda calculus if we erase the types (and translate type-adjacent features appropriately, like dictionary-passing for type classes)). From an untyped perspective, the idea of `Void` becomes dubious; it feels like `P(x) -> Void` throws a `Success` exception, and that's the only way for a test to pass (returning normally is assumed to be failure, and `Success` is encapsulated so we can't throw it directly). That feels very strange to me, but worth thinking about some more...

pron · 2025-01-12T19:27:40 1736710060

> don't think it's too far off though, since we can (and, as I argued, should) tease apart the data from the assertion, to get something like `let x = 3; assert(x > 2)`; and that's equivalent to `(lambda x: assert(x > 2))(3)`

Right, but even that is an unquantified proposition.

> If I try stretching my constructive-logic argument a bit, we could implement negation `¬A` as a function `A -> Void`

There's no need for constructive logic specifically (a test presumes that it runs in some bounded time, which means we're operating on finite domains). Tests can invert both their assertions and their result, so if they were existentially quantified they could also have been universally quantified.

Indeed, tests can express (though not effectively execute) an existential quantifier (∃xP(x)) over a finite domain (though not infinite ones):

    for x in Domain {
        if P(x) return true // test success
    }
    return false // test failure

And that can easily be turned into a universal quantifier through negation:

    for x in Domain {
        if !P(x) then return false
    }
    return true

This is the exact transformation as in (∀x : P(X)) ≣ (¬∃x : ¬P(x)); the internal predicate is negated and then the entire thing.

> but test suites rely on runtime execution, which is inherently untyped

Types are a syntactic concept and they are, therefore, a property of a language; a language can be typed or untyped (types serve as part of the rules for determining whether a string is in the language or not). There are several ways to describe execution, but usually it isn't described as a language, so it is no more typed or untyped as it is yellow or not yellow.