I hear this a lot but there are vast sums of money thrown at where a model fails the strawberry cases.
Think about math and logic. If a single symbol is off, it’s no good.
Like a prompt where we can generate a single tokenization error at my work, by my very rough estimates, generates 2 man hours of work. (We search for incorrect model responses, get them to correct themselves, and if they can’t after trying, we tell them the right answer, and edit it for perfection). Yes even for counting occurrences of characters. Think about how applicable that is. Finding the next term in a sequence, analyzing strings, etc.
But we don’t restrict it to math or logical syntax. Any prompt across essentially all domains. The same model is expected to handle any kind of logical reasoning that can be brought into text. We don’t mark it incorrect if it spells an unimportant word wrong, however keep in mind the spelling of a word can be important for many questions, for example—off the top of my head: please concatenate “d”, “e”, “a”, “r” into a common English word without rearranging the order. The types of examples are endless. And any type of example it gets wrong, we want to correct it. I’m not saying most models will fail this specific example, but it’s to show the breadth of expectations.
Think about math and logic. If a single symbol is off, it’s no good.
Like a prompt where we can generate a single tokenization error at my work, by my very rough estimates, generates 2 man hours of work. (We search for incorrect model responses, get them to correct themselves, and if they can’t after trying, we tell them the right answer, and edit it for perfection). Yes even for counting occurrences of characters. Think about how applicable that is. Finding the next term in a sequence, analyzing strings, etc.