The metadata exists in special karaoke recordings, but assuming they're using original recordings not created/modified for karaoke, they'd have to create it on the fly:
I'd guess it's done using the same speech-to-text system used by voice assistants, which can certainly show the words it hears in near-realtime -- and way more quickly when it already knows what words it's listening for.
By the way, karaoke often highlights individual syllables, not just whole words.
I'd guess it's done using the same speech-to-text system used by voice assistants, which can certainly show the words it hears in near-realtime -- and way more quickly when it already knows what words it's listening for.
By the way, karaoke often highlights individual syllables, not just whole words.