Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The only way to really do it is to add a second layer of processing that evaluates safety while removing the task of evaluation from the base model answering.

But that's around 2x the cost.

Even human brains depend on the prefrontal cortex to go "wait a minute, I should not do this."



What we get instead is both layers at once. Try asking questions like these to Bing instead of ChatGPT - it's the same GPT-4 (if set to "creative") under the hood, and quite often it will happily start answering... only to get interrupted midsentence and the message replaced with something like "I'm sorry, I cannot assist with that".

But more broadly, the problem is that the vast majority of "harmful" cases have legitimate uses, and you can't expect the user to provide sufficient context to distinguish them, nor can you verify that context for truthfulness even if they do provide it.


That struck me too. You don't need to lobotomize the model that answers questions, you just need to filter out "bad" questions and reply "I'm sorry Dave, I'm afraid I can't do that".

Would it be 2x cost? Surely the gatekeeper model can be a fair bit simpler and just has to spit out a float between 0 and 1.

(caveat: this is so not my area).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: