There are in fact several steps. Training on large text corpora produces a completion model; a model that completes whatever document you give it as accurately as possible. It's kind of hard to make those do useful work, as you have to phrase things as partial solutions that are then filled in. Lots of 'And clearly, the best way to do x is [...]' style prompting tricks required.
Instruction tuning / supervised fine tuning is similar to the above but instead of feeding it arbitrary documents, you feed it examples of 'assistants completing tasks'. This gets you an instruction model which generally seems to follow instructions, to some extent. Usually this is also where specific tokens are baked in that mark boundaries of what is assistant response, what is human, what delineates when one turn ends / another begins, the conversational format, etc.
RLHF / similar methods go further and ask models to complete tasks, and then their outputs are graded on some preference metric. Usually that's humans or a another model that has been trained to specifically provide 'human like' preference scores given some input. This doesn't really change anything functionally but makes it much more (potentially overly) palatable to interact with.