Is there some sort of "LLM-on-Wikipedia" competition?
ie: given "just wikipedia" what's the best score people can get on however these models are evaluated.
I know that all the commercial ventures have a voracious data-input set, but it seems like there's room for dictionary.llm + wikipedia.llm + linux-kernel.llm and some sort of judging / bake-off for their different performance capabilities.
Or does the training truly _NEED_ every book every written + the entire internet + all knowledge ever known by mankind to have an effective outcome?
Lossy and lossless are more interchangeable in computer science than people give credit so i wouldn't dwell on that too much. You can optimally convert one into the other with arithmetic coding. In fact the actual best in class algorithms that have won the hutter prize are all lossy behind the scenes. They make a prediction on the next data using a model (often AI based) which is a lossy process and with arithmetic coding they losslessly encode the next data with bits proportional to how correct the prediction was. In fact the reason why the hutter prize is lossless compression is exactly because converting lossy to lossless with arithmetic coding is a way to score how correct a lossy prediction is.
>> Or does the training truly _NEED_ every book every written + the entire internet + all knowledge ever known by mankind to have an effective outcome?
I have the same question.
Peter Norvig’s GOFAI Shakespeare generator example[1] (which is not an LLM) gets impressive results with little input data to go on. Does the leap to LLM preclude that kind of small input approach?
[1] link should be here because I assumed as I wrote the above that I would just turn it up with a quick google. Alas t’was not to be. Take my word for it, somewhere on t’internet is an excellent write up by Peter Norvig on LLM vs GOFAI (good old fashioned artificial intelligence)
ie: given "just wikipedia" what's the best score people can get on however these models are evaluated.
I know that all the commercial ventures have a voracious data-input set, but it seems like there's room for dictionary.llm + wikipedia.llm + linux-kernel.llm and some sort of judging / bake-off for their different performance capabilities.
Or does the training truly _NEED_ every book every written + the entire internet + all knowledge ever known by mankind to have an effective outcome?