> 1. The AI answers many questions, some of which are potentially harmful, and g...

> 1. The AI answers many questions, some of which are potentially harmful, and generates first draft answers. ... > 4. The system repeats this process until it collects a large dataset of first draft answers, and rewritten more-ethical second-draft answers. > 5. The system trains the AI to write answers that are less like the first drafts, and more like the second drafts.

Actually, there are 2 separate models involved in this finetuning step: AI in steps 1-4 is a pre-existing LLM trained to converse and follow instructions (I call it "advisor"). The second AI starts out as a pre-trained LLM (so only trained to provide likely completions on a general dataset), and is trained in step 5 to generate completions that look more like the dataset constructed in 1-4.

So this process is using the helpfulness in the adviser to bootstrap additional values onto a new model.

> What if you had overseer AIs performing Constitutional AI Feedback on trainee AIs, or otherwise tried to separate out the labor?

That is literally what the paper is about, though its a bit hidden. First paragraph of 4.1 Methods:

"We continue to utilize human feedback labels for helpfulness as in prior work, but replace human feedback labels with model feedback labels for harmlessness. That is, instead of asking crowdworkers to provide comparison labels for harmlessness, we simply present the same task to an independent model, called the feedback model (typically a pretrained LM). Once the desired comparison labels are obtained, the remainder of the training pipeline (i.e., preference model training and RL) is exactly the same as RLHF."