Would you be able to share platform/performance numbers? I've actually been thinking about doing something like this but I'm coming from the opposite direction (hardware guy who's always wanted to learn more about 3D graphics).
Going at 640x480, 32 bits per pixel, 4x multisample antialiasing at 60 frames per second, he could do triangle counts that "beat the first Voodoo cards" (I don't remember the figures). Voodoo2 could push out 3 million triangles per second, so that would be around 50k per frame. In the same ballpark but something a bit less than that.
Output was bit-banged VGA signal into a cheap-ass monitor.
Like the first Voodoo-era 3d accelerators, it was a triangle rasterizer with Z-buffering and perspective correct texture mapping. Ie. there was no 3d transformations done on the chip, they were done on the CPU. The limiting factor in the demos was actually the ARM CPU (synthetic on the FPGA) which couldn't push enough triangles to keep the GPU busy.
It was a tile based rasterizer (in two stages: coarse and fine) rather than a scanline rasterizer (like SW rasterizers in the Quake era).
That's very helpful, thanks. The company I work for doesn't do any 3D graphics, so I should be able to open-source my implementation (if and when it gets done).