There are errors, based on evaluation criteria. To say that one token following the next is the only criteria is not how we got here, is it? We clearly have much, much more in criteria for good and bad output from these LLMs than that.
Likewise, i cranked the Temp up for experimentation, and it produced gibberish. The randomizing aspect of which token is chosen after the next was so heavy handed that it literally couldn't even create words. This wouldn't be considered "good" by most people, would it? Would it for you? Technically it's one token after the next, but it's objectively worse at auto-completion than my phone's keyboard.
My point is i think you can have both. Dumb autocomplete, but trained well enough that it's useful. That it doesn't try to say Obama is a Cat. Or that 2+2=5. Yes, there will be edge cases where the weights produce odd output - but that's the entire goal of the LLM research, right? To see how far we can steer a dumb autocomplete into usefulness.
If you're argument is that "that gibberish output is perfect, no errors found" because the program technically ran, and the weights worked.. well, i've got no reply. I can only assume you're referring to the foundational LLM, and i'm referring to the training moreso - a more holistic sense of "is the program working". But if you consider the gibberish working, then frankly most bad programs (crashing/etc) would be "working" - right? Because they're doing exactly what they're programmed to.
Working, or lack of errors, seem to be a semantic, human interpretation. But i'm quite in the weeds, heh.
> To say that one token following the next is the only criteria is not how we got here, is it?
You could just read the research about LLMs before saying stuff like this.
You can't even seem to grasp that "Obama is a cat" as a statement isn't gibberish. I'm not even trying to convince you that these programs are perfect, I'm just trying to make sure that you understand that these aren't categorical errors and the things you consider successes aren't even happening.
> I'm not even trying to convince you that these programs are perfect, I'm just trying to make sure that you understand that these aren't categorical errors and the things you consider successes aren't even happening.
Yea, we're just talking past each other. I believe i understand what you're saying. I on the otherhand, am describing errors in UX.
Your point seems pedantic, tbh. Hopefully by now i've expressed something in the way of convincing you that: for the little i do "know" about these, admittedly not much, it is that they're nothing but pattern predictors. Token outputs based on token inputs. No intelligence. Yet you spend repeated replies which sound effectively like "Stop calling them errors!" when they are very clearly errors in the context of UX.
Your argument, if i understand correctly, is pointless because the goal of the app is to have the LLM prediction be aligned with a human-centric world view. Up is not down, and the LLM should not predict tokens that espouses that. In that context, the LLM replying "Up is indeed down" would be an error. Yet repeatedly you argue that it's not an error.
In my view your argument would be better spent saying "The LLM application as we strive for today is impossible. It will never be. It's snake oil. LLMs will never be reasonably and consistently correct by human interpretation"
I don't know if that's your view or not. But it's at least not talking past me, about a point i'm not even talking about. My frame for this conversation was if we can make token prediction aligned with human goals of accuracy. You saying inaccuracies are not errors "categorically" isn't in line with my original question as i see at least. It's apples and oranges.
Embarrassed about what, exactly? You seem hostile, i'm trying not to.
I stand by everything i said.
> My hope is that even if it never goes beyond being an autocomplete; if we can improve the training dataset, help it not conflict with itself, etc - that maybe the autocomplete will be insanely useful.
I stand by my first post's summary. Which is "never going past an autocomplete".
You're pedantic, and struggling to move past the fact that something can be both token prediction and still have successes and failures in the user perception. Inaccuracies.
How you write software with such a mindset is beyond me.
Likewise, i cranked the Temp up for experimentation, and it produced gibberish. The randomizing aspect of which token is chosen after the next was so heavy handed that it literally couldn't even create words. This wouldn't be considered "good" by most people, would it? Would it for you? Technically it's one token after the next, but it's objectively worse at auto-completion than my phone's keyboard.
My point is i think you can have both. Dumb autocomplete, but trained well enough that it's useful. That it doesn't try to say Obama is a Cat. Or that 2+2=5. Yes, there will be edge cases where the weights produce odd output - but that's the entire goal of the LLM research, right? To see how far we can steer a dumb autocomplete into usefulness.
If you're argument is that "that gibberish output is perfect, no errors found" because the program technically ran, and the weights worked.. well, i've got no reply. I can only assume you're referring to the foundational LLM, and i'm referring to the training moreso - a more holistic sense of "is the program working". But if you consider the gibberish working, then frankly most bad programs (crashing/etc) would be "working" - right? Because they're doing exactly what they're programmed to.
Working, or lack of errors, seem to be a semantic, human interpretation. But i'm quite in the weeds, heh.