This is already being explored. See: https://nlp.elvissaravia.com/i/159010545/au...

		callmeal on April 13, 2025 \| parent \| context \| favorite \| on: Google is winning on every AI front This is already being explored. See: https://nlp.elvissaravia.com/i/159010545/auditing-llms-for-h... `The researchers deliberately train a language model with a concealed objective (making it exploit reward model flaws in RLHF) and then attempt to expose it with different auditing techniques.`