> APL languages don't benefit much from compilation
Sorry, but this is nonsense. APL tends to spend all its time waiting for memory, because it fuses nought. And it benefits as much as anybody else from things like common subexpression elimination and loop-invariant code motion. APL implementations appear to be fast because of a few extenuating factors:
1. C compilers kind of suck, and c kind of sucks for writing fast code, yet it is the de-facto standard, so it is what things are compared to.
2. Developers of apl implementation care about and prioritise performance.
3. In comparison with languages such as python, and in particular their popular implementations, apl spends very little time on dispatch.
What kinds of numbers do you expect here? It's really only arithmetic that's memory-bound, even something like a scan or compress/filter does a fair amount of CPU work and wouldn't benefit that much from fusion (if you even can fuse compress). And with fewer loops I don't think loop optimization is as relevant. I think a factor of 2 is the median kind of improvement I'd expect from a compiled SIMD APL on array programs? Which is a long way from the 10x or so you get by adding a JIT to other dynamic interpreted languages, and likely harder. I would describe that as APL not benefitting much from compilation; maybe you wouldn't.
I think a factor of 2 is quite significant, but I also think you may understate the advantages of compilation. I do agree the benefit is prone to be less than for a language like python.
Regarding scans (filter is a type of scan, howbeit easier to implement than the general case), there is in general an annoying work-span discrepancy, which bodes ill for contemporary computers with their finite parallelism. I would buffer heavily here, but fusing the scan with its input allows for the use of a work-efficient implementation, spending all latent parallelism on generating more inputs.
When I speak of loops, I am including any usage of rank (incl. implicit); in this respect, apl is chock-full of loops. Some loop optimisations may be obviated, of course, because of referential transparency, but others arise in their place. Take for instance some recent work[0] done on futhark: rather than parallelise an already in-place algorithm, they had to in-place an already parallel algorithm!
Two more points:
A problem may be bound by memory latency. I spent some time tuning j's I. recently, and it does many searches in parallel (4 for a small search space, 12 for a large one, iirc), but it is still bound by latency. If I could fetch a pivot corresponding to the first cell of y, and then generate the next cell while I wait, I would make better use of resources.
A compiler results in more transparent performance characteristics. When I target an interpreter, I must write my idioms and special combinations in exactly the way it expects, else nothing will happen; a compiler will care much less about exact phrasing. Similarly, like I mentioned, you get licm and cse for free, rather than needing to do them by hand.
Feels like a pretty idyllic view of compilers? I guess I agree that there's more than a factor of 2 in instruction latency versus throughput, but I don't think a compiler can use the entire gap. For a quick check, I measured 1.82 instructions per cycle to compile ~0.3MB of source in BQN (the compiler's a very array-oriented program for those unaware) and 1.38 ipc on ~1.7MB where caching's worse. The maximum is probably somewhat less than 4 since not all instructions fit 4 to a cycle; on the other hand, BQN primitives are probably already doing some things that are throughput-inefficient to use available parallelism better. Fused compiler output definitely wouldn't include as many loop counter and load/store instructions.
Some of the problems you mention, especially special combinations, can be addressed by a bytecode compiler that ultimately defers to an interpreted VM.
The binary search interleaving seems like exactly the kind of thing that APL interpreters can do to close the gap with compilers. "generate the next cell while I wait" sounds difficult in general: instruction reordering only goes so far, so if the binary search loop needs many iterations, then only part of it can overlap with other code (barring some compiler intervention that I'd categorize as fanciful if the loop length isn't known). That other work could also be polluting the cache.
Your binary search work is definitely interesting; I found the code and will be studying it as this is something I've been wanting to address better in BQN. Mind if I email about this? I implemented a different solution to high latencies for large searches in Dyalog, which is to partition the array y (searched-for values) by a single pivot from x and recurse. See the last paragraph at https://mlochbaum.github.io/BQN/implementation/primitive/sor....
> For a quick check, I measured 1.82 instructions per cycle to compile ~0.3MB of source in BQN (the compiler's a very array-oriented program for those unaware) and 1.38 ipc on ~1.7MB where caching's worse
It's not really clear to me what this is measuring, and whether that thing is interesting. I care about throughput as measured in widgets per unit time, not instructions per unit time; the contribution of a compiler may not only be in instructions per unit time, but also instructions per widget.
> Some of the problems you mention, especially special combinations, can be addressed by a bytecode compiler that ultimately defers to an interpreted VM.
Sure. But aside from implementation effort, that seems monotonically worse than a compiler targeting machine code.
> if the binary search loop needs many iterations, then only part of it can overlap with other code (barring some compiler intervention that I'd categorize as fanciful if the loop length isn't known)
What interventions? FWIW I think the main limitation is (architectural) registers (plus rob/rename, of course), but you can balance around that by putting multiple _dependent_ search iterations, and leaving enough time for them all to finish.
Also: I think it's probably good to have at least a rough idea of how many iterations things are going to take, and am vaguely planning to specialise on order of magnitude of each dimension (plus exact value when small). Knowing how big things are also gives you a fairly good idea of what is going to pollute cache (conservatively assume that random accesses are uniformly distributed) and by how much.
> Mind if I email about this?
Feel free! elronnd@elronnd.net. FWIW I think the gather hammer, where supported, is probably ideal--I benchmarked it, and it lost, but I was on AMD hardware at the time, which is quite abysmal in that respect; intel seems to be much better, especially with avx512--and that leads to much more straightforward code.
> I implemented a different solution to high latencies for large searches in Dyalog, which is to partition the array y (searched-for values) by a single pivot from x and recurse.
I figured it was doing something like that. Was thinking about implementing something similar, but vaguely hoped that if the buffers were large enough and I could keep them filled, it would be possible to avoid the overhead of partitioning and of scattering results. Also the option, depending on relative sizes, of preprocessing x, producing e.g. a b-tree or eytzinger. Fairly low priority, so I've not yet looked too closely.
Sorry, but this is nonsense. APL tends to spend all its time waiting for memory, because it fuses nought. And it benefits as much as anybody else from things like common subexpression elimination and loop-invariant code motion. APL implementations appear to be fast because of a few extenuating factors:
1. C compilers kind of suck, and c kind of sucks for writing fast code, yet it is the de-facto standard, so it is what things are compared to.
2. Developers of apl implementation care about and prioritise performance.
3. In comparison with languages such as python, and in particular their popular implementations, apl spends very little time on dispatch.