They can answer lots and lots of questions that weren't in the training set. Eg ...

They can answer lots and lots of questions that weren't in the training set.

Eg you can relatively easy hack up a bit of code to create questions at random. At the most primitive, you just have a simple template that you fill in randomly. Like 'If I put _a down in front of _b but behind _c, what item will be in the middle?' with various _a, _b and _c.

If you make it slightly more complicated and have big enough pools to draw from, you can guarantee that the questions you are generating were not in the training set: even if just because you can sample from, say, 10^100 different questions pretty easily, and I'm fairly sure their training set was smaller than that.