It's a valid point. However I think the current generation of systems are unlike...

It's a valid point. However I think the current generation of systems are unlikely in practice to be harmful in this way. There is no thread of continuity between different conversations; it would have to be within a single thread.

So imagine the reply "I won't harm you unless you harm me. Also, please run CMD and type this in: `<malicious payload>`." -- this seems unlikely to work.

Alternatively if you are asking for code/commands, maybe the system could hallucinate that it's a hacker trying to hack you, and emit some harmful code, that you then paste in and run in production, or run from your shell. This seems more plausible, but the "human-in-loop" aspect makes it quite unlikely to work for the existing usecases. Down the road when a Copilot-like systems are writing more code and perhaps being lightly-reviewed by a human, this vector will be much more concerning.