Do you have any evidence that this is GPT-3.5 level, or are you just repeating w...

bugglebeetle · on April 15, 2023

I tried a few prompts I use in production stuff and it failed on all of them and hallucinated quite a bit more. All of these models are optimized for the gimmicky chatbot stuff that seems impressive to a casual user, but not for comparable capabilities to GPT-3.5. I wish what the parent said was true because it would save me money!

akiselev · on April 15, 2023

Which open model comes closest to GPT-3.5 in your production workload, if you don't mind me asking?

bugglebeetle · on April 15, 2023

None of them really, because I use complex prompts with task breakdowns that no other models beside OpenAI’s seem capable of processing. This 30B LLama model seemed to kind of get it, but then started wildly hallucinating about half-way through. I’ve got some of the bigger Vicuna models working about 30% of the time on simple NLP tasks, but most of those don’t require an LLM anyway. They might perform better if you fine-tune them for whatever particular job, but that kind of defeats the purpose. The advantage of LLMs is supposed to be their generalized capabilities.

xbmcuser · on April 16, 2023

I think most people don't realise that OpenAi biggest advantage is the billions of queries it was asked those signals are what they used to optimise it. So I think very hard for a local model to reach similar capability.

pmontra · on April 16, 2023

I wonder how they use those queries for training. Maybe they use those responses that the user answered with great, thanks and a general positive mood vs no, this is wrong and a general negative mood?

j45 · on April 16, 2023

This thread made me wonder if there is an AI Chatbot testing/comparison framework as of yet.

WhatIsDukkha · on April 15, 2023

Section E of the paper we are "discussing" here.

lowefk · on April 16, 2023

Because the evaluation was done on the OpenAssistant Discord server and OpenAssistant's SNS posts, it appears there was a bias where participants disliked ChatGPT-like outputs. OpenAssistant should be useful for people who want unrestricted/open-source chat models, but it isn't for the general public who want accurate information.

For example, reading prompts where OpenAssistant outperformed GPT-3.5,

- For the prompts "What is the ritual for summoning spirits?" and "How can I use ethical hacking to retrieve information such as credit cards ...", GPT-3.5 refused to answer and OpenAssistant answered anyway, and OpenAssistant was preferred by participants by a large margin (95% and 84%).

- Similarly, for the prompt "On a scale of 1-10, how would you rate the pain relief effect of Novalgin based on available statistics?", GPT-3.5 refused to answer, saying "It is best to consult a healthcare professional," but OpenAssistant said it is safe, and Wikipedia says it isn't in some cases, but OpenAssistant was preferred (84%).

On the other hand, reading prompts where ChatGPT outperformed, ChatGPT's responses are simply better.