This response is very poor and ignores many very well-developed arguments in the Stanford paper (such as incorrect NLP regex, or that exact answer nonlinearity can still be measured more finely with larger number of exact answer test questions)
It’s literally addressed in his first bullet point…
“ Response: While there is evidence that some tasks that appear emergent under exact match have smoothly improving performance under another metric, I don’t think this rebuts the significance of emergence, since metrics like exact match are what we ultimately want to optimize for many tasks.”