I suppose it does work for interpreted languages (you just need to define what is success and what is failure), but for AFL the evaluation might be too far away from the actual branching that it would be less effective, possibly critically so. Additionall in fuzzing the bottleneck is running the programs, millions of times (though maybe not billions). So the slower your function is, the slower the fuzzing will be.
I think though the same approach could be used with interpreters, and I expect it would be easier to do there. E.g. for Python there is this, but I haven't tried it: https://github.com/jwilk/python-afl
This just rings of famous last words to me. There are many errors that pass this test. Edge cases in arbitrary code is not easy.
Makes me wonder how fuzzers do it. Just random data? How guided is it?