They can answer lots and lots of questions that weren't in the training set.
Eg you can relatively easy hack up a bit of code to create questions at random. At the most primitive, you just have a simple template that you fill in randomly. Like 'If I put _a down in front of _b but behind _c, what item will be in the middle?' with various _a, _b and _c.
If you make it slightly more complicated and have big enough pools to draw from, you can guarantee that the questions you are generating were not in the training set: even if just because you can sample from, say, 10^100 different questions pretty easily, and I'm fairly sure their training set was smaller than that.
Eg you can relatively easy hack up a bit of code to create questions at random. At the most primitive, you just have a simple template that you fill in randomly. Like 'If I put _a down in front of _b but behind _c, what item will be in the middle?' with various _a, _b and _c.
If you make it slightly more complicated and have big enough pools to draw from, you can guarantee that the questions you are generating were not in the training set: even if just because you can sample from, say, 10^100 different questions pretty easily, and I'm fairly sure their training set was smaller than that.