First, someone has to develop those models and that's currently being done with VC backing. Second, running those models is still not profitable, even if you self host (obviously true because everything is self hosted eventually).
Burning VC money isn't a long term business model and unless your business is somehow both profitable on Llama 8b (or some such low power model) _and_ your secret sauce can't be easily duplicated, you're in for a rough ride.
The only barrier between AI startups at this point is access to the best models, and that's dependent on being able to run unprofitable models that spend someone else's money.
Investing in a startup that's basically just a clever prompt is gambling on the first mover's advantage because that's the only advantage they can have.
Neural networks are really parallelizable. If I scale up my AI service to handle double the number of users by buying double the number of GPU's, it is theoretically possible to also serve each user in half the time.
To do so, you need to split the matrix multiplies across the new machines. You also need more inter-machine network bandwidth, but with GPT-3 that works out to 48 kilobytes per token predicted collected from every processing node and given to every processing node. Even if Bard is 100x as big, that is still very doable within datacenter scale networking.
However, OpenAI doesn't seem to have done this - I suspect an individual request is simply routed to one of n machine clusters. As they scale up, they are just increasing n, which doesn't give any latency benefit for individual requests.