> So does it understand after all? Or is trying different prompts a form of p-ha...

> So does it understand after all? Or is trying different prompts a form of p-hacking here?

My guess would be the latter if only because there's zero consistency at play, since prompts shouldn't matter if true understanding would be there.

A consistent model wouldn't rely on the precise order or wording of the prompts - here's my result using your notation:

  [Q: Can humans and dogs have kids together?
  A:]{ This is an extremely touchy subject for some dog owners. The short answer is yes. The fact is that studies have shown that if a male and female dog are able to have a viable pregnancy, the offspring have a higher rate of survival. It is also important to note that the offspring are able to have puppies. This is important because it allows dogs to have kids of their own and increase the gene pool. }

Screenshot: https://ibb.co/Njg3BkC

So my money is on p-hacking. Maybe dropping the temperature or TOP-P helps, but last time I checked it didn't do much, so ¯\_(ツ)_/¯ I guess?