Now, use this library to "bootstrapp the smarts of LLaMA from its own smartness" like this:
1. Ask it things. Let it answer.
2. Ask it to find errors in the answer it outputted and for it to correct the answer.
3. Use the original prompt and the corrected output as training data.
This should, with each iteration make the model less and less likely to output statements that are self contradictions or obviously wrong, until the model can no longer spot its own faults.
I recall reading that when training AlphaZero they would start pitching it against itself doing millions of games in a few days, which worked great because there is an external metric (who wins the chess game) that would objectively be a good measure to train towards.
But if you let an AI's approval be the metric, things turn a lot more fussy and subjective. The goal is not actually "to write a good answer without error" but actually "to write an answer that is approved by the AI". Those are very different goals, and as you keep using it you'll get a bigger and bigger divergence, until eventually the AI is just answering complete garbage nonsense that precisely hits certain sweet spots in the grading AI.
This divergence of the target vs the actual human goal is a pretty interesting problem in AI safety research. I love the example where an AI trained to stay alive as long as possible in Tetris realized that pausing the game was the best strategy.
I honestly think I might do this experiment, just to see what comes out. I know it will be utter garbage, but it will probably be interesting utter garbage.
The correction prompt is very important, it will definitely determine the outcome of the process, a bad correction prompt will obviously lead to a garbage result.
Training in steps with different prompts might be of value. First step might be to fix contradictions, then factual errors if that is an issue. This is an idea that I got when viewing the he output of LLaMA, it often contains contradictions (eg. an example I have seen is "Peter is a boy and he is part of the Gama sorority"). Asking it to fix those types of issues should be a first good step.
But I suspect that this type of training would need to be mixed with original training data. Otherwise the restructuring in the model caused by the new training would most likely garble the rest of the model.
For those skeptical of the above comment, this technique absolutely works and powers production-grade models like Anthropic’s Claude. There’s plenty of literature on this, but here are a couple papers that might be helpful for people doing their own training:
- Constitutional AI: by Anthropic, an “RLAIF” technique that creates the preference model for “finding errors” based on a set of around 70 “principles” the AI uses to check its own output, not human feedback like in ChatGPT. This technique taught the Claude bot to avoid harmful output with few to no manual harmfulness labels! https://arxiv.org/abs/2212.08073. Not sure if there’s a HuggingFace implementation with LoRA / PEFT yet like there is for regular RLHF, so somebody may need to implement this for Llama still
- Self-Instruct: Creates artificial training data on instruction tuning from an untuned base model, from a tiny seed of prompts, and filters out the bad ones before fine-tuning. Manages to approach Instruct-GPT performance with only ~100 human labels. https://arxiv.org/abs/2212.10560
You should try using a larger model like llama-35b or even GPT-3 for the feedback. That way you might be able to condense knowledge from these really big models into a smaller model
This is a cool idea in theory and I think could be useful in certain kinds of circumstances, but this particular instantiation would likely go into a bad bias spiral.
This is somewhat similar to how GANs try to learn the density of the underlying data, but here you do not have the underlying data as a reference, if that makes sense. It's sort of like filling a mattress with helium instead of air. Sure, the mattress will be lighter, but that does not mean you will float on it, if that makes any sense at all.
Hope that helps as a cogent answer to this question.
1. Ask it things. Let it answer.
2. Ask it to find errors in the answer it outputted and for it to correct the answer.
3. Use the original prompt and the corrected output as training data.
This should, with each iteration make the model less and less likely to output statements that are self contradictions or obviously wrong, until the model can no longer spot its own faults.