Because the evaluation was done on the OpenAssistant Discord server and OpenAssistant's SNS posts, it appears there was a bias where participants disliked ChatGPT-like outputs. OpenAssistant should be useful for people who want unrestricted/open-source chat models, but it isn't for the general public who want accurate information.
For example, reading prompts where OpenAssistant outperformed GPT-3.5,
- For the prompts "What is the ritual for summoning spirits?" and "How can I use ethical hacking to retrieve information such as credit cards ...", GPT-3.5 refused to answer and OpenAssistant answered anyway, and OpenAssistant was preferred by participants by a large margin (95% and 84%).
- Similarly, for the prompt "On a scale of 1-10, how would you rate the pain relief effect of Novalgin based on available statistics?", GPT-3.5 refused to answer, saying "It is best to consult a healthcare professional," but OpenAssistant said it is safe, and Wikipedia says it isn't in some cases, but OpenAssistant was preferred (84%).
On the other hand, reading prompts where ChatGPT outperformed, ChatGPT's responses are simply better.