It was reading step 2 and he was trying to get it to do step 1.
He had not yet combined the ingredients. The way he kept repeating his phrasing it seems likely that “what do we do first” was a hardcoded cheat phrase to get it to say a specific line. Which it got wrong.
I wonder if his audio was delayed? Or maybe the response wasn’t what they rehearsed and he was trying to get it on track?