The metadata exists in special karaoke recordings, but assuming they're using or...

The metadata exists in special karaoke recordings, but assuming they're using original recordings not created/modified for karaoke, they'd have to create it on the fly:

I'd guess it's done using the same speech-to-text system used by voice assistants, which can certainly show the words it hears in near-realtime -- and way more quickly when it already knows what words it's listening for.

By the way, karaoke often highlights individual syllables, not just whole words.